I feel like there is a little misunderstanding here.

Rahul, did you try setting the option that Steven suggested?
`store.partition.hash_distribute`

This will cause a re-distribution of the data so that the rows that belong
in a particular partition will all be written by a single writer. They will
not necessarily be all in one file, as we have a limit on file sizes and I
don't think we cap partition size.

The default behavior is not to re-distribute, because it is expensive. This
however means that every fragment will write out a file for whichever keys
appear in the data that ends up at that fragment.

If there is a large number of fragments and the data is spread out pretty
randomly, then there is a reasonable case for turning on this option to
co-locate data in a single partition to a single writer to reduce the
number of smaller files. There is no magic formula for when it is best to
turn on this option, but in most cases it will reduce the number of files
produced.



On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> Well this for generating some testdata
>
> - Rahul
>
> On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
> > Looks like Drill is doing the partitioning as requested then. May not be
> > optimal though.
> >
> > Is there a reason why you want to subpartition this much? You may be
> > better of to just partition by l_shipdate (not shipmate, autocorrect got
> me
> > there). Or use columns with much lower cardinality to test
> subpartitioning.
> >
> > —Andries
> >
> >
> > > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> > >
> > > Steven,
> > >
> > > You were right. The count is 606240 which is 8*75780.
> > >
> > >
> > > Stefan & Andries,
> > >
> > > Below is the distinct count or cardinality
> > >
> > > select count(*) from (select l_shipdate, l_receiptdate from
> > > dfs.`/drill/testdata/tpch100/
> > > lineitem` group by l_shipdate, l_receiptdate) sub;
> > > +---------+
> > > | EXPR$0  |
> > > +---------+
> > > | 75780   |
> > > +---------+
> > >
> > > - Rahul
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > > aengelbre...@maprtech.com> wrote:
> > >
> > >> What is the distinct count for this columns? IIRC TPC-H has at least 5
> > >> years of data irrespective of SF, so you are requesting a lot of
> > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> > >> correlating receipt date data, your query doesn’t count the actual
> > files.
> > >>
> > >> Try to partition just on the shipmate column first.
> > >>
> > >> —Andries
> > >>
> > >>
> > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is it possible that the combination values of  (l_shipdate,
> > >>> l_receiptdate) have a very high cardinality?
> > >>> I would think you are creating partition files for a small subset of
> > the
> > >>> data.
> > >>>
> > >>> Please keep in mind that I know nothing about TPCH SF100 and only a
> > >> little
> > >>> about Drill :).
> > >>>
> > >>> Regards,
> > >>> -Stefan
> > >>>
> > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <s...@apache.org>
> > wrote:
> > >>>
> > >>>> It would be helpful if you could figure out what the file count is.
> > But
> > >>>> here are some thoughs:
> > >>>>
> > >>>> What is the value of the option:
> > >>>> store.partition.hash_distribute
> > >>>>
> > >>>> If it is false, which it is by default, then every fragment will
> > >>>> potentially have data in every partition. In this case, that could
> > >> increase
> > >>>> the number of files by a factor of 8.
> > >>>>
> > >>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> > >>>> challapallira...@gmail.com> wrote:
> > >>>>
> > >>>>> Drillers,
> > >>>>>
> > >>>>> I executed the below query on TPCH SF100 with drill and it took
> ~2hrs
> > >> to
> > >>>>> complete on a 2 node cluster.
> > >>>>>
> > >>>>> alter session set `planner.width.max_per_node` = 4;
> > >>>>> alter session set `planner.memory.max_query_memory_per_node` =
> > >>>> 8147483648;
> > >>>>> create table lineitem partition by (l_shipdate, l_receiptdate) as
> > >> select
> > >>>> *
> > >>>>> from dfs.`/drill/testdata/tpch100/lineitem`;
> > >>>>>
> > >>>>> The below query returned 75780, so I expected drill to create the
> > same
> > >> no
> > >>>>> of files or may be a little more. But drill created so many files
> > that
> > >> a
> > >>>>> "hadoop fs -count" command failed with a "GC overhead limit
> > exceeded".
> > >> (I
> > >>>>> did not change the default parquet block size)
> > >>>>>
> > >>>>> select count(*) from (select l_shipdate, l_receiptdate from
> > >>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> > >>>> l_receiptdate)
> > >>>>> sub;
> > >>>>> +---------+
> > >>>>> | EXPR$0  |
> > >>>>> +---------+
> > >>>>> | 75780   |
> > >>>>> +---------+
> > >>>>>
> > >>>>>
> > >>>>> Any thoughts on why drill is creating so many files?
> > >>>>>
> > >>>>> - Rahul
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Reply via email to