MapRFS

Eric Pederson Tue, 08 Mar 2016 10:46:21 -0800

Hi everyone -

Thanks for your feedback.   Answers to your questions below.


The query plan JSON for the JSON query (where the performance flattened
out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a.
This was the plan with all three Drillbits running.

Some more detail on the structure of the JSON.  There are 8 objects at the
first level.  The biggest one has a little over 500 fields, the majority of
them being arrays of numbers or arrays of strings.  The next biggest group
contains around 300 flat objects.  The rest of the groups are fairly small,
20-40 fields.

I didn't do any casting in the CTAS from JSON to Parquet due to the sheer
number of fields. :)

Thanks,



-- Eric

On Mon, Mar 7, 2016 at 6:02 PM, Eric Pederson <[email protected]> wrote:

> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>   { "group1":
>      { "field1": 42,
>        "field2: [ "a", "b", "c" ],
>        ...
>      }
>    { "group2":
>       ....
>    }
>    ...
> }
>
> There are about 500 objects like this in each JSON file.
>
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
>
> I wanted to see how the number of Drillbits would impact the query time,
> to try to extrapolate to the number of servers needed to reach a
> performance number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
>
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
>
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
>
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
>
> Thanks,
>
>
>
> --
> Sent from Gmail Mobile
>

Re: Parallelism / data locality in HDFS/MapRFS

Reply via email to