We are using MapR M3 and are querying multiple JSON files - around 250
files at 1.5 GB per file.   We have a small cluster of three machines
running Drill 1.4.  The JSON is nested three-four levels deep, in a format
like:
{
  { "group1":
     { "field1": 42,
       "field2: [ "a", "b", "c" ],
       ...
     }
   { "group2":
      ....
   }
   ...
}

There are about 500 objects like this in each JSON file.

I've been testing a set of queries that scan all of the data (we're
investigating a partitioning strategy but haven't settled on one that will
fit all of our queries which are fairly ad-hoc).   These full-scan queries
typically take 1 minute, 20 seconds using the default settings  If I limit
the query to a single file the query takes a few seconds.

I wanted to see how the number of Drillbits would impact the query time, to
try to extrapolate to the number of servers needed to reach a performance
number.   Here are the numbers that we saw:
- 1 Drillbit: 3:45
- 2 Drillbits: 1:56
- 3 Drillbits: 1:20

The performance flattens out between two and three Drillbits.   I was
surprised to see that, given the single file query performance.  I was
hoping to throw hardware at the performance a bit more.   Is that
surprising to you?

A somewhat related question.  Does Drill take advantage of HDFS locality?
That is, will it send certain fragments to certain boxes because it knows
those boxes have the data replicated locally?  Actually in our setup (3
servers) that might be a moot point assuming every box has all blocks.  I'm
not sure if MapRFS changes that.

I also tried converting the JSON files to Parquet using CTAS.  The Parquet
queries took much longer than the JSON queries.  Is that expected as well?

Thanks,



-- 
Sent from Gmail Mobile

Reply via email to