thnx, will do
On Thu, Feb 4, 2016 at 11:49 PM, Jason Altekruse
wrote:
> We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they
> have not been thoroughly tested. That being said we do use the standard
> parquet-mr interfaces for reading parquet files in our complex parquet
> r
We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they
have not been thoroughly tested. That being said we do use the standard
parquet-mr interfaces for reading parquet files in our complex parquet
reader. We are currently depending on 1.8.1 in Drill, so it should be
compatible.
OK, the automatic handling and encoding options improve a lot in Parquet
2.0. (Manual override is not an option)
I'm using parquet-mr/parquet-avro to create parquet 2 files
(ParquetProperties.WriterVersion.PARQUET_2_0).
Drill seems to read them just fine but I wonder if there are any gotchas
Reg
Is there any more information I can supply in this issue?
Its a blocker for Drill adoption for us, and my ability to diagnose
exceptions in Java based systems is very limited ;)
On 27 Jan 2016, at 14:30, Matt wrote:
https://issues.apache.org/jira/browse/DRILL-4317
On 26 Jan 2016, at 23:50,
Can you share error details as well?
-Hanifi
On Thu, Feb 4, 2016 at 2:05 PM, Nicolas Paris wrote:
> Hello,
>
> I have problem to load csv containg double double quote eg:
> col1;col2;col3
> 1;"\"\"foo\"\"";"bar"
> 1;"\"\"foo\"\"";"bar"
>
> Thanks !
>
Hello,
I have problem to load csv containg double double quote eg:
col1;col2;col3
1;"\"\"foo\"\"";"bar"
1;"\"\"foo\"\"";"bar"
Thanks !
While both parquet and javascript are widely used, they kind of exist in
different worlds. I cannot find a javscript reader for parquet files.
That being said, I'm not so sure that one ought to exist, as parquet files
are designed specifically for storing volumes of data for scan efficiency.
Are
Hi, Jason sorry for the confusion; I'm generating both cvs files and
parquet. Parquet is just an experiment for me to see if I get better
performance than with CSV or loading the csv into something like TinyDB or
MongoDB.
I've found a way to read the parquet files with a python library; So
Ya, do you see where I am coming from here? Let's let the users submit
regex in the pure form if possible, and code the nuances of java regex
behind the scenes. I think it would be a great way to make Drill very
accessible and desirable. I think what happened in Hive is the regex
commands started
Are you even trying to write parquet files? in your original post you said
you are writing CSV files, but then gave files with parquet extensions as
what you are trying to concatenate.
I'm a little confused though if you are not working with tools for big
data, concatenating parquet files is not t
You mean:
userRegex=>javaRegex
"\d" => "\\d"
"\w" => "\\w"
"\n" => "\n"
I can do that thanks to regex I guess.
I will give a try
2016-02-04 19:37 GMT+01:00 John Omernik :
> So my question on the double escape, is there no way to handle that so the
> user can use single escaped regex? I know many
On a desktop you will likely be limited on memory.
Perhaps set width to 1 to go on single threaded execution, and use 512MB or 1GB
for parquet block size pending how much memory the Drillbit has for direct
memory. This will limit the number of parquet files being created, see how much
smaller t
Hi Andries, the trouble is that I run Drill on my desktop machine, but I
have no server available to me that is capable of running Drill. Most
$10/month hosting accounts do not permit you to run java apps. For this
reason I simply use Drill for "pre-processing" of the files that I
eventually wi
Hi,
What things do I need to know if I want to write Drill compatible Parquet
in Java using Parquet-MR?
- Latest stable version of Parquet-MR is 1.8.1 is that too new?
- Will the standard Parquet work?
- Any specific footer information required
- Are there any does and don'ts?
I wan
So my question on the double escape, is there no way to handle that so the
user can use single escaped regex? I know many folks who use big data
platform to test large complex regexes for things like security appliances,
and having to convert the regex seems like a lot of work if you consider
every
You can create multiple parquet files and have the ability to query them all
through the Drill SQL interface with minimal overhead.
Creating a single 50GB parquet file is likely not be the best option for
performance, perhaps use Drill partitioning for the parquet files to speed up
queries and
Tip for navigating large Github repos. You can type 't' when looking at the
folder structure to open a fast global search. Searching for the functions
is a little extra-complicated in Drill because we actually generate a bunch
of them to cover all of the types. This means that source code templates
Sorry, bad typo: I have 50GB of data, NOT 500GB ;). And I usually only
query a 1 GB subset of this data using Drill.
On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail
wrote:
> On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
>> Is there a rea
On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:
> Is there a reason to create a single file? Typically you may want more
> files to improve parallel operation on distributed systems like drill.
>
Good question. I'm not actually using Drill for "big data
John, Jason,
2016-02-04 18:47 GMT+01:00 John Omernik :
> I'd be curios on how you are implemeting the regex... using Java's regex
> libraries? etc.
>
Yeah, I use
java.util.regex
> I know one thing with Hive that always bothered me was the need to double
> escape things.
>
> '\d\d\d\d-\d\d-\d
Yeah, not ideal. We should get a JIRA up and fix this.
Since I've seen the code, it isn't surprising either. An easier way to
understand this behavior is run the query select dir0 from t limit 1 (where
t is one directory versus two). In the single case, you'll see that dir0 is
null. (Thus is why t
I'd be curios on how you are implemeting the regex... using Java's regex
libraries? etc.
I know one thing with Hive that always bothered me was the need to double
escape things.
'\d\d\d\d-\d\d-\d\d' needed to be '\\d\\d\\d\\d-\\d\\d-\\d\\d' of we can
avoid that it would be AWESOME.
On Thu, Feb
I think you should actually just put the function in Drill itself. System
native functions are implemented in the same interface as UDFs, because our
mechanism for evaluating them is very efficient (we code generate code
blocks by linking together the bodies of the individual functions to
evaluate
Hi again,
I did a little test and ~5 million fairly wide records take 791 MB in
parquet without dictionary encoding and 550MB with dictionary encoding
enabled (The non-dictionary encoded file is a whooping 45% bigger).
The plain, non-dictionary-encoding, file returns results for identical
queries
I can package up both plans for you if you need them (let me know if you
still want them) but I can tell you the plans were EXACTLY the same,
however the data-sum table took 0.932 seconds to plan the query, and the
data table (the one with the all the extra data) took 11.379 seconds to
plan the que
Sorry, I wasn't clear on that, but yes, when there is exactly ONE sub
directory, and I run a query with no filter, it returns the correct count
and returns fast.
On Thu, Feb 4, 2016 at 10:26 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:
> John
> What happens if you do the select
Jason, I have it working,
Just tell me the way to proceed to PR.
1. where do I put my maven project ? Witch folder in my drill github fork?
2. do I need a jira ? how proceed ?
For now, I only published it on my github account in a separate project
Thanks
2016-02-04 16:52 GMT+01:00 Jason Altekru
Hey John, can you try an explain plan for both queries and see how much
times it takes ?
for example, for the first query you would run:
*explain plan for* select count(1) from `data/2016-02-03`;
It can also be helpful if you could share the query profiles for both
queries.
Thanks
On Thu, Feb
John
What happens if you do the select query with no filter.
The scenario you explained does seem like an unexpected behavior.
-Neeraja
On Thu, Feb 4, 2016 at 8:21 AM, John Omernik wrote:
> Prior to posting a JIRA, I thought I'd toss this here:
>
> If I have a directory: data with subdirectori
Is there a reason to create a single file? Typically you may want more files to
improve parallel operation on distributed systems like drill.
That said, if you have a single node drill cluster (or embedded mode) you can
reduce the threads to a single thread and increase the parquet file size for
Prior to posting a JIRA, I thought I'd toss this here:
If I have a directory: data with subdirectories with parquet files in it
data/2016-01-01
data/2016-01-02
(Seem familiar? This came up in my other testing)
If I have MORE then one subdirectory,
then
select count(1) from `data/` where dir
Hey all, I think am I seeing an issue related to
https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it
out here, see if it's really the case, and then determine what the blockers
may be to resolution.
I am using the MapR Developer Release 1.4, and I have a directory with
subdi
Awesome, thanks!
On Thu, Feb 4, 2016 at 7:44 AM, Nicolas Paris wrote:
> Well I am creating a udf
> good exercise
> I hope a PR soon
>
> 2016-02-04 16:37 GMT+01:00 Jason Altekruse :
>
> > I didn't realize that we were lacking this functionality. As the
> > repeated_contains operator handles wildc
Hi Jason,
Thank you for the explanation.
I have several *low* cardinality fields that contain semi-long values and
they are, I think, a perfect candidate for dictionary encoding.
I assumed that the choose to use dictionary encoding was a bit smarter than
this and would rely on Strings type colum
Well I am creating a udf
good exercise
I hope a PR soon
2016-02-04 16:37 GMT+01:00 Jason Altekruse :
> I didn't realize that we were lacking this functionality. As the
> repeated_contains operator handles wildcards it makes sense to add such a
> function to drill.
>
> It should be simple to imple
I didn't realize that we were lacking this functionality. As the
repeated_contains operator handles wildcards it makes sense to add such a
function to drill.
It should be simple to implement, would someone like to open a JIRA and
submit a PR for this?
- Jason
On Tue, Feb 2, 2016 at 8:56 AM, John
Hi Stefan,
There is a reason that dictionary is disabled by default. The parquet-mr
library we leverage for writing parquet files currently has the behavior to
write nearly all columns as dictionary encoded for all types when
dictionary encoding is enabled. This includes columns with integers,
dou
Noted, I've updated the gist. Thanks John.
Chris Matta
cma...@mapr.com
215-701-3146
On Thu, Feb 4, 2016 at 10:12 AM, John Omernik wrote:
> Like I said, I don't believe read -s is posix compliant, hence why I went
> with the stty -echo based on
>
> http://stackoverflow.com/questions/3980668/how-
Like I said, I don't believe read -s is posix compliant, hence why I went
with the stty -echo based on
http://stackoverflow.com/questions/3980668/how-to-get-a-password-from-a-shell-script-without-echoing
Thus, I went that route for more portability.
On Thu, Feb 4, 2016 at 8:54 AM, Christopher Mat
Looks good. You can streamline the no echo of the password by passing read
an -s flag. I’ve also updated it to allow for a -u or —user flag:
#!/bin/bash
USERNAME=
PASSWORD=
DRILL_VER=drill-1.4.0
DRILL_LOC=/opt/mapr/drill
URL=jdbc:drill:zk=10.10.15.10:5181,10.10.15.11:5181,10.10.15.12:5181/drill/s
Hi, is there a way to force drill to create a single file when performing a
CTAS command (or some other method).
Right now, I'm creating CSV files, and then have to perform and extra step
to stitch 1_0_0.parquet 1_1_0.parquet 1_2_0.parquet etc. together into a
single file.
Thank you.
Peder
That works, here is the script I came up with (mostly based on Ted's script
with a few terminal reads). Feel free to include this script in Drill for
people to use, Security wise, this is fairly sound, 5 seconds of a file
existing with the user's credentials, that is only readable by the user
seem
42 matches
Mail list logo