Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Eric Pederson
We are using MapR M3 and are querying multiple JSON files - around 250 files at 1.5 GB per file. We have a small cluster of three machines running Drill 1.4. The JSON is nested three-four levels deep, in a format like: { { "group1": { "field1": 42, "field2: [ "a", "b", "c" ],

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Andries Engelbrecht
You may want to look at the query plan between the 3 scenarios to see which operators time is spend on and how well they are parallelized. The expectation would be that Parquet will perform better than JSON. --Andries > On Mar 7, 2016, at 3:02 PM, Eric Pederson wrote: > > We are using MapR

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Jacques Nadeau
The flattening is surprising unless we're spending a long time in query setup. (This is shown by looking at the query start time for the 0-0 fragment in the query profile screen.) If you share the profile json files, we can also take a look and see what is up. thanks, Jacques -- Jacques Nadeau C

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Ted Dunning
> On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson wrote: > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? No. That is not expected.

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread John Omernik
The slowness you saw with Parquet can be heavily dependent on on how your CTAS was written. Did you cast to types as needed? Drill could be making some fast and loose assumptions about your data, and thus typing incorrectly. When I was in a similar scenario, I used some stronger typing and saw qu

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread Jason Altekruse
Considering your description of the data, 1.5 GB per file with only 500 records in each give you somewhere around 30 MB records. This in itself doesn't necessarily cause an issue, but the structure of your example record makes me think you may have many individual columns in the nested structure, r

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread Eric Pederson
Hi everyone - Thanks for your feedback. Answers to your questions below. The query plan JSON for the JSON query (where the performance flattened out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a. This was the plan with all three Drillbits running. Some more detail on the str

Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread Eric Pederson
Here is the query plan JSON for the same query against the Parquet file that I CTASed: https://gist.github.com/sourcedelica/b05eeaf5df9e63b29654. It took 1794.618 seconds. -- Eric On Tue, Mar 8, 2016 at 1:44 PM, Eric Pederson wrote: > Hi everyone - > > Thanks for your feedback. Answers to yo