Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Jinfeng Ni
Parth, You are right. If we put t.others.additional in select list, in addition to t.others, then the output is wrong. The JSON file I used has 2 rows: {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}} {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes",

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Parth Chandra
Given the sample rows that Stefan provided, the query - select `some`, t.others, t.others.additional from `test.json` t; does produce incorrect results - *| *yes * | *{"additional":"last entries only"} * | *last entries only * |* instead of *| *yes * | *{"other":"true","all":"false"

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Ted Dunning
On Thu, Jul 23, 2015 at 2:19 PM, Juergen Kneissl wrote: > On 07/23/15 22:04, Jason Altekruse wrote: > > I'm very glad to hear that it exceeded your expectations. An important > > point I would like to add, when you unzipped the file you likely allowed > > drill to ready not only on both nodes, bu

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
hi, I can provide you with json file an statements to reproduce it if you wish. thank you for looking into this. regards, -Stefan On Jul 23, 2015 9:03 PM, "Jinfeng Ni" wrote: > Hi Stefán, > > Thanks a lot for bringing up this issue, which is really helpful to improve > Drill. > > I tried to

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Juergen Kneissl
On 07/23/15 22:04, Jason Altekruse wrote: > I'm very glad to hear that it exceeded your expectations. An important > point I would like to add, when you unzipped the file you likely allowed > drill to ready not only on both nodes, but also on multiple threads on each > node. When the file was compr

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Jinfeng Ni
Hi Stefán, Thanks a lot for bringing up this issue, which is really helpful to improve Drill. I tried to re-produce the incorrect issues, and I could re-produce the missing data issue of CTAS parquet, but I could not re-produce the missing data issue if I query the JSON file directly. Here is ho

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Thank you. On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning wrote: > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter > wrote: > > > Someone must review the underlying optimization errors to prevent this > from > > happening to others. > > > > Jinfeng and Parth are examining this issue to try to co

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Jason Altekruse
I'm very glad to hear that it exceeded your expectations. An important point I would like to add, when you unzipped the file you likely allowed drill to ready not only on both nodes, but also on multiple threads on each node. When the file was compressed, only a single thread was reading and proces

Re: Several questions...

2015-07-23 Thread Ted Dunning
On Thu, Jul 23, 2015 at 8:18 AM, Jacques Nadeau wrote: > The good news is, Drill does provide a nice simple way to abstract these > details away. You simply create a view on top of HBase [1]. The view can > contain the physical conversions. Then users can interact with the view > rather than t

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Ted Dunning
On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter wrote: > Someone must review the underlying optimization errors to prevent this from > happening to others. > Jinfeng and Parth are examining this issue to try to come to a deeper understanding. Not surprisingly, they are a little quiet as they do

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Juergen Kneissl
Hi Jason, On 07/23/15 18:53, Jason Altekruse wrote: > I could be wrong, but I believe that gzip is not a compression that can be > split, you must read and decompress the file from start to end. In this > case we can not parallelize the read. This stackoverflow article mentions > bzip2 as an alter

Drill Hangout (2015-07-21) minutes

2015-07-23 Thread Parth Chandra
Drill Hangout 2015-07-21 Participants: Jacques, Parth (scribe), Sudheesh, Hakim, Khurram, Aman, Jinfeng, Kristine, Sean Feature list for Drill 1.2 was discussed. The following items were considered (disussion/ comments if any are summarized with each item): 1. Memory allocator improvemen

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Jason Altekruse
I could be wrong, but I believe that gzip is not a compression that can be split, you must read and decompress the file from start to end. In this case we can not parallelize the read. This stackoverflow article mentions bzip2 as an alternative compression used by hadoop to solve this problem and a

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Juergen Kneissl
Yes of course: I add the SQL and the output of EXPLAIN PLAN FOR: - jdbc:drill:schema=dfs> explain plan for SELECT columns[4] stichtag, columns[10] geschlecht, count(columns[0]) anzahl FROM dfs.`/mon_ew_xt_uni_bus_11.csv.gz` where 1 = 1 and columns[23] = 1

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Abdel Hakim Deneche
I don't think Drill is supposed to "ignore" data. My understanding is that the reader will read the new fields which will cause a schema change, and depending on the query (if all operators involved can handle the schema change or not) the query should either succeed or fail. My understanding is th

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi, The only right answer to this question must be to a) "adapt to additional information" and b) "try the hardest to accommodate changes". The current behavior must be seen as completely worthless (sorry for the strong language). Regards, -Stefan On Thu, Jul 23, 2015 at 4:16 PM, Matt wrote:

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Matt
On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote: When you try to read schema-less data, Drill will first investigate the 1000 rows to figure out a schema for your data, then it will use this schema for the remaining of the query. To clarify, if the JSON schema changes on the 1001st 1MMth

Re: Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Abdel Hakim Deneche
Hi Juergen, can you share the query you tried to run ? Thanks On Thu, Jul 23, 2015 at 9:10 AM, Juergen Kneissl wrote: > Hi everybody, > > I installed and configured a small cluster with two machines (gnu/linux) > with the following setup: > > zookeeper in version 3.4.6 , drill in version 1.1.0

Foreman Parallelizer not working with compressed csv file?

2015-07-23 Thread Juergen Kneissl
Hi everybody, I installed and configured a small cluster with two machines (gnu/linux) with the following setup: zookeeper in version 3.4.6 , drill in version 1.1.0 and also using hadoop (version 2.7.1) hdfs as dist. filesystem. So, I am playing around a bit, but what I am still not understandin

Re: How drill works?

2015-07-23 Thread Jinfeng Ni
Even for csv or json format, directory-based Partition pruning [1] could be leveraged to prune data. You have to use the special dir* field in your query to filter out un-wanted data, or define a view which uses dir* field and then query against the view. 1. https://drill.apache.org/docs/partition

Re: Several questions...

2015-07-23 Thread Jacques Nadeau
Unfortunately, HBase hasn't embraced embedded schema last I checked. There are definitely tools on top of HBase that do provide this. For example I believe Wibi and Cask both provide a more structured approach on top of HBase. Someone could extend the plugin to support these systems. The good n

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi Abdel, Thank you for taking the time to respond. I know my frustration is leaking through but that does not mean I don appreciate everything you and the Drill team is doing, I do. I also understand the premise of the optimization but I find it to restrictive and it certainly does not fit our d

Re: How drill works?

2015-07-23 Thread Abdel Hakim Deneche
Hi Hafiz, I guess it depends on the query. Generally Drill will try to push any filter you have in your query to the leaf nodes so they won't send any row that doesn't pass the filter. Also only the columns that appear in the query will be loaded from the file. The file format you are querying al

How drill works?

2015-07-23 Thread Hafiz Mujadid
Hi all! I want to know about drill working. Suppose i query to data on S3. the volume of data is huge in GB's. So when I query to that data what happens? whether drill load whole data on drill nodes? or just query data without loading whole data?

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Abdel Hakim Deneche
Hi Stefan, Sorry to hear about your misadventure in Drill land. I will try to give you some more informations, but I also have limited knowledge for this specific case and other developers will probably jump in to correct me. When you try to read schema-less data, Drill will first investigate the

Re: Several questions...

2015-07-23 Thread Kristine Hahn
Sounds great. The docs are written in markdown and stored in github-pages. You can contribute to the docs using a pull request. Click the pencil icon on the top right side of the page, and go for it. Thanks much, really appreciate your feedback and help. Kristine Hahn Sr. Technical Writer 415-497-

Re: Several questions...

2015-07-23 Thread Alex Ott
Thank you for pointing me to this section - somehow I missed it. How do you maintain this documentation? Maybe I have time to add more examples, so it will be easier for other people to start to work with HBase/Drill combo. On Thu, Jul 23, 2015 at 3:38 PM, Kristine Hahn wrote: > These data type

Re: Several questions...

2015-07-23 Thread Kristine Hahn
These data types are listed http://drill.apache.org/docs/supported-data-types/#convert_to-and-convert_from, but need to be easier to find and include useful examples as Ted pointed out. Sorry you had a problem. We'll add links to the types from strategic places. On Thursday, July 23, 2015, Alex Ot

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

2015-07-23 Thread Stefán Baxter
Hi, The workaround for this was to edit the first line in the json file and fake a value for the "additional" field. That way the optimizer could not decide to ignore it. Someone must review the underlying optimization errors to prevent this from happening to others. JSON data, which is unstruct

Re: Querying Apache Spark Generated Parquet

2015-07-23 Thread Usman Ali
Thank you very much for your kind interest. Its very unfortunate that I am currently stuck somewhere else. I will share sample data with you in a day or so. Usman Ali On Thu, Jul 23, 2015 at 6:59 AM, Neeraja Rentachintala < nrentachint...@maprtech.com> wrote: > Hi > Do you still see this issue.

Re: Several questions...

2015-07-23 Thread Alex Ott
Thank you Jacques The INT_BE made the trick - now I'm getting status 200 instead of the negative number. The problem is that I haven't seen any mention of this type anywhere in the documentation - maybe the corresponding section of the conversions should be expanded, because it refers only to sta