We log for Parquet footer reading and block Map building.  What are the
reported times for each in your scenario?  Are you on HDFS or MFS?

Thx
On May 7, 2015 10:47 AM, "Adam Gilmore" <dragoncu...@gmail.com> wrote:

> Hey sorry my mistake - you're right.  Didn't see it executing those in
> TimedRunnables.  I wonder why then it's such a significant impact for only
> 70 files.  I can pretty easily replicate it by using the globbing to select
> just a subset, then select the whole lot (i.e. 35 files takes about 200ms
> "planning" time).
>
> Sounds like caching that metadata would be a great start, though.
> Especially with the Parquet pushdown filtering, it needs to reevaluate
> those footers again to see how many files it may eliminate, thus it'll
> effectively be doing this effort twice.
>
> I might do a bit more debugging and see if I can trap exactly where the
> extra cost is.  I would think unless the I/O is bottlenecking or it's not
> parallelised somewhere, it shouldn't be such a significant impact.
>
> On Thu, May 7, 2015 at 4:35 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>
> > The read should be parallelized.  See FooterGatherer.  What makes you
> think
> > it isn't parallelized?
> >
> > We've seen this set of operations expensive in some situations and quite
> > bad in the case of 100,000's of files.  We're working on improvement to
> > this issue with this jira:
> >
> > https://issues.apache.org/jira/browse/DRILL-2743
> >
> > Note, I also think Steven has identified some places where we re-get
> > FileStatus multiple times which can also lead to poorer start
> performance.
> > I"m not sure there is an issue open against this but we should get one
> > opened and resolved.
> >
> > On Wed, May 6, 2015 at 11:13 PM, Adam Gilmore <dragoncu...@gmail.com>
> > wrote:
> >
> > > Just a follow up - I have isolated that it is almost linear according
> to
> > > the number of Parquet files.  The footer read is quite expensive and
> not
> > > parallelised at all (it uses it for query planning).
> > >
> > > Is there any way to control the row group size when creating Parquet
> > > files?  I could create fewer, larger files, but still want the benefit
> of
> > > smaller row groups (as I have just done the Parquet pushdown
> filtering).
> > >
> > > On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore <dragoncu...@gmail.com>
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I've been looking at the speed of some of our queries and have
> noticed
> > > > there is quite a significant delay to the query actually starting.
> > > >
> > > > For example, querying about 70 Parquet files in a directory, it takes
> > > > about 370ms before it starts the first fragment.
> > > >
> > > > Obviously, considering it's not in the plan, it's very hard to see
> > where
> > > > exactly it's spending that 370ms without instrumenting/debugging.
> > > >
> > > > How can I troubleshoot where Drill is spending this 370ms?
> > > >
> > >
> >
>

Reply via email to