Re: Parquet drill date fields

2016-02-04 Thread Stefán Baxter
thnx, will do On Thu, Feb 4, 2016 at 11:49 PM, Jason Altekruse wrote: > We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they > have not been thoroughly tested. That being said we do use the standard > parquet-mr interfaces for reading parquet files in our complex parquet > r

Re: Parquet drill date fields

2016-02-04 Thread Jason Altekruse
We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they have not been thoroughly tested. That being said we do use the standard parquet-mr interfaces for reading parquet files in our complex parquet reader. We are currently depending on 1.8.1 in Drill, so it should be compatible.

Re: Parquet drill date fields

2016-02-04 Thread Stefán Baxter
OK, the automatic handling and encoding options improve a lot in Parquet 2.0. (Manual override is not an option) I'm using parquet-mr/parquet-avro to create parquet 2 files (ParquetProperties.WriterVersion.PARQUET_2_0). Drill seems to read them just fine but I wonder if there are any gotchas Reg

Re: CTAS error with CSV data

2016-02-04 Thread Matt
Is there any more information I can supply in this issue? Its a blocker for Drill adoption for us, and my ability to diagnose exceptions in Java based systems is very limited ;) On 27 Jan 2016, at 14:30, Matt wrote: https://issues.apache.org/jira/browse/DRILL-4317 On 26 Jan 2016, at 23:50,

Re: DRILL 1.4 - CSV - fail when value containing double quote

2016-02-04 Thread Hanifi Gunes
Can you share error details as well? -Hanifi On Thu, Feb 4, 2016 at 2:05 PM, Nicolas Paris wrote: > Hello, > > I have problem to load csv containg double double quote eg: > col1;col2;col3 > 1;"\"\"foo\"\"";"bar" > 1;"\"\"foo\"\"";"bar" > > Thanks ! >

DRILL 1.4 - CSV - fail when value containing double quote

2016-02-04 Thread Nicolas Paris
Hello, I have problem to load csv containg double double quote eg: col1;col2;col3 1;"\"\"foo\"\"";"bar" 1;"\"\"foo\"\"";"bar" Thanks !

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Jason Altekruse
While both parquet and javascript are widely used, they kind of exist in different worlds. I cannot find a javscript reader for parquet files. That being said, I'm not so sure that one ought to exist, as parquet files are designed specifically for storing volumes of data for scan efficiency. Are

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Peder Jakobsen | gmail
Hi, Jason sorry for the confusion; I'm generating both cvs files and parquet. Parquet is just an experiment for me to see if I get better performance than with CSV or loading the csv into something like TinyDB or MongoDB. I've found a way to read the parquet files with a python library; So

Re: REGEX search Operator

2016-02-04 Thread John Omernik
Ya, do you see where I am coming from here? Let's let the users submit regex in the pure form if possible, and code the nuances of java regex behind the scenes. I think it would be a great way to make Drill very accessible and desirable. I think what happened in Hive is the regex commands started

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Jason Altekruse
Are you even trying to write parquet files? in your original post you said you are writing CSV files, but then gave files with parquet extensions as what you are trying to concatenate. I'm a little confused though if you are not working with tools for big data, concatenating parquet files is not t

Re: REGEX search Operator

2016-02-04 Thread Nicolas Paris
You mean: userRegex=>javaRegex "\d" => "\\d" "\w" => "\\w" "\n" => "\n" I can do that thanks to regex I guess. I will give a try 2016-02-04 19:37 GMT+01:00 John Omernik : > So my question on the double escape, is there no way to handle that so the > user can use single escaped regex? I know many

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Andries Engelbrecht
On a desktop you will likely be limited on memory. Perhaps set width to 1 to go on single threaded execution, and use 512MB or 1GB for parquet block size pending how much memory the Drillbit has for direct memory. This will limit the number of parquet files being created, see how much smaller t

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Peder Jakobsen | gmail
Hi Andries, the trouble is that I run Drill on my desktop machine, but I have no server available to me that is capable of running Drill. Most $10/month hosting accounts do not permit you to run java apps. For this reason I simply use Drill for "pre-processing" of the files that I eventually wi

Writing Drill compatible Parquet in Java using parquet-mr

2016-02-04 Thread Stefán Baxter
Hi, What things do I need to know if I want to write Drill compatible Parquet in Java using Parquet-MR? - Latest stable version of Parquet-MR is 1.8.1 is that too new? - Will the standard Parquet work? - Any specific footer information required - Are there any does and don'ts? I wan

Re: REGEX search Operator

2016-02-04 Thread John Omernik
So my question on the double escape, is there no way to handle that so the user can use single escaped regex? I know many folks who use big data platform to test large complex regexes for things like security appliances, and having to convert the regex seems like a lot of work if you consider every

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Andries Engelbrecht
You can create multiple parquet files and have the ability to query them all through the Drill SQL interface with minimal overhead. Creating a single 50GB parquet file is likely not be the best option for performance, perhaps use Drill partitioning for the parquet files to speed up queries and

Re: REGEX search Operator

2016-02-04 Thread Jason Altekruse
Tip for navigating large Github repos. You can type 't' when looking at the folder structure to open a fast global search. Searching for the functions is a little extra-complicated in Drill because we actually generate a bunch of them to cover all of the types. This means that source code templates

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Peder Jakobsen | gmail
Sorry, bad typo: I have 50GB of data, NOT 500GB ;). And I usually only query a 1 GB subset of this data using Drill. On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail wrote: > On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht < > aengelbre...@maprtech.com> wrote: > >> Is there a rea

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Peder Jakobsen | gmail
On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht < aengelbre...@maprtech.com> wrote: > Is there a reason to create a single file? Typically you may want more > files to improve parallel operation on distributed systems like drill. > Good question. I'm not actually using Drill for "big data

Re: REGEX search Operator

2016-02-04 Thread Nicolas Paris
John, Jason, 2016-02-04 18:47 GMT+01:00 John Omernik : > I'd be curios on how you are implemeting the regex... using Java's regex > libraries? etc. > ​Yeah, I use java.util.regex ​ > I know one thing with Hive that always bothered me was the need to double > escape things. > > '\d\d\d\d-\d\d-\d

Re: Bug or Feature?

2016-02-04 Thread Jacques Nadeau
Yeah, not ideal. We should get a JIRA up and fix this. Since I've seen the code, it isn't surprising either. An easier way to understand this behavior is run the query select dir0 from t limit 1 (where t is one directory versus two). In the single case, you'll see that dir0 is null. (Thus is why t

Re: REGEX search Operator

2016-02-04 Thread John Omernik
I'd be curios on how you are implemeting the regex... using Java's regex libraries? etc. I know one thing with Hive that always bothered me was the need to double escape things. '\d\d\d\d-\d\d-\d\d' needed to be '\\d\\d\\d\\d-\\d\\d-\\d\\d' of we can avoid that it would be AWESOME. On Thu, Feb

Re: REGEX search Operator

2016-02-04 Thread Jason Altekruse
I think you should actually just put the function in Drill itself. System native functions are implemented in the same interface as UDFs, because our mechanism for evaluating them is very efficient (we code generate code blocks by linking together the bodies of the individual functions to evaluate

Re: Parquet drill date fields

2016-02-04 Thread Stefán Baxter
Hi again, I did a little test and ~5 million fairly wide records take 791 MB in parquet without dictionary encoding and 550MB with dictionary encoding enabled (The non-dictionary encoded file is a whooping 45% bigger). The plain, non-dictionary-encoding, file returns results for identical queries

Re: Query Planning and Directory Pruning

2016-02-04 Thread John Omernik
I can package up both plans for you if you need them (let me know if you still want them) but I can tell you the plans were EXACTLY the same, however the data-sum table took 0.932 seconds to plan the query, and the data table (the one with the all the extra data) took 11.379 seconds to plan the que

Re: Bug or Feature?

2016-02-04 Thread John Omernik
Sorry, I wasn't clear on that, but yes, when there is exactly ONE sub directory, and I run a query with no filter, it returns the correct count and returns fast. On Thu, Feb 4, 2016 at 10:26 AM, Neeraja Rentachintala < nrentachint...@maprtech.com> wrote: > John > What happens if you do the select

Re: REGEX search Operator

2016-02-04 Thread Nicolas Paris
Jason, I have it working, Just tell me the way to proceed to PR. 1. where do I put my maven project ? Witch folder in my drill github fork? 2. do I need a jira ? how proceed ? For now, I only published it on my github account in a separate project Thanks 2016-02-04 16:52 GMT+01:00 Jason Altekru

Re: Query Planning and Directory Pruning

2016-02-04 Thread Abdel Hakim Deneche
Hey John, can you try an explain plan for both queries and see how much times it takes ? for example, for the first query you would run: *explain plan for* select count(1) from `data/2016-02-03`; It can also be helpful if you could share the query profiles for both queries. Thanks On Thu, Feb

Re: Bug or Feature?

2016-02-04 Thread Neeraja Rentachintala
John What happens if you do the select query with no filter. The scenario you explained does seem like an unexpected behavior. -Neeraja On Thu, Feb 4, 2016 at 8:21 AM, John Omernik wrote: > Prior to posting a JIRA, I thought I'd toss this here: > > If I have a directory: data with subdirectori

Re: Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Andries Engelbrecht
Is there a reason to create a single file? Typically you may want more files to improve parallel operation on distributed systems like drill. That said, if you have a single node drill cluster (or embedded mode) you can reduce the threads to a single thread and increase the parquet file size for

Bug or Feature?

2016-02-04 Thread John Omernik
Prior to posting a JIRA, I thought I'd toss this here: If I have a directory: data with subdirectories with parquet files in it data/2016-01-01 data/2016-01-02 (Seem familiar? This came up in my other testing) If I have MORE then one subdirectory, then select count(1) from `data/` where dir

Query Planning and Directory Pruning

2016-02-04 Thread John Omernik
Hey all, I think am I seeing an issue related to https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it out here, see if it's really the case, and then determine what the blockers may be to resolution. I am using the MapR Developer Release 1.4, and I have a directory with subdi

Re: REGEX search Operator

2016-02-04 Thread Jason Altekruse
Awesome, thanks! On Thu, Feb 4, 2016 at 7:44 AM, Nicolas Paris wrote: > Well I am creating a udf > good exercise > I hope a PR soon > > 2016-02-04 16:37 GMT+01:00 Jason Altekruse : > > > I didn't realize that we were lacking this functionality. As the > > repeated_contains operator handles wildc

Re: Parquet drill date fields

2016-02-04 Thread Stefán Baxter
Hi Jason, Thank you for the explanation. I have several *low* cardinality fields that contain semi-long values and they are, I think, a perfect candidate for dictionary encoding. I assumed that the choose to use dictionary encoding was a bit smarter than this and would rely on Strings type colum

Re: REGEX search Operator

2016-02-04 Thread Nicolas Paris
Well I am creating a udf good exercise I hope a PR soon 2016-02-04 16:37 GMT+01:00 Jason Altekruse : > I didn't realize that we were lacking this functionality. As the > repeated_contains operator handles wildcards it makes sense to add such a > function to drill. > > It should be simple to imple

Re: REGEX search Operator

2016-02-04 Thread Jason Altekruse
I didn't realize that we were lacking this functionality. As the repeated_contains operator handles wildcards it makes sense to add such a function to drill. It should be simple to implement, would someone like to open a JIRA and submit a PR for this? - Jason On Tue, Feb 2, 2016 at 8:56 AM, John

Re: Parquet drill date fields

2016-02-04 Thread Jason Altekruse
Hi Stefan, There is a reason that dictionary is disabled by default. The parquet-mr library we leverage for writing parquet files currently has the behavior to write nearly all columns as dictionary encoded for all types when dictionary encoding is enabled. This includes columns with integers, dou

Re: Sqlline Tricks

2016-02-04 Thread Christopher Matta
Noted, I've updated the gist. Thanks John. Chris Matta cma...@mapr.com 215-701-3146 On Thu, Feb 4, 2016 at 10:12 AM, John Omernik wrote: > Like I said, I don't believe read -s is posix compliant, hence why I went > with the stty -echo based on > > http://stackoverflow.com/questions/3980668/how-

Re: Sqlline Tricks

2016-02-04 Thread John Omernik
Like I said, I don't believe read -s is posix compliant, hence why I went with the stty -echo based on http://stackoverflow.com/questions/3980668/how-to-get-a-password-from-a-shell-script-without-echoing Thus, I went that route for more portability. On Thu, Feb 4, 2016 at 8:54 AM, Christopher Mat

Re: Sqlline Tricks

2016-02-04 Thread Christopher Matta
Looks good. You can streamline the no echo of the password by passing read an -s flag. I’ve also updated it to allow for a -u or —user flag: #!/bin/bash USERNAME= PASSWORD= DRILL_VER=drill-1.4.0 DRILL_LOC=/opt/mapr/drill URL=jdbc:drill:zk=10.10.15.10:5181,10.10.15.11:5181,10.10.15.12:5181/drill/s

Creating a single parquet or csv file using CTAS command?

2016-02-04 Thread Peder Jakobsen | gmail
Hi, is there a way to force drill to create a single file when performing a CTAS command (or some other method). Right now, I'm creating CSV files, and then have to perform and extra step to stitch 1_0_0.parquet 1_1_0.parquet 1_2_0.parquet etc. together into a single file. Thank you. Peder

Re: Sqlline Tricks

2016-02-04 Thread John Omernik
That works, here is the script I came up with (mostly based on Ted's script with a few terminal reads). Feel free to include this script in Drill for people to use, Security wise, this is fairly sound, 5 seconds of a file existing with the user's credentials, that is only readable by the user seem