I'll double check the debug logs.

We're getting about a 350ms delay for 70 files, about 200ms for 35 files,
about 20-30ms for 1 file.

We're using HDFS.

It does't appear that it's just saturating HDFS with reads, either.

On Thu, May 7, 2015 at 8:30 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> We log for Parquet footer reading and block Map building.  What are the
> reported times for each in your scenario?  Are you on HDFS or MFS?
>
> Thx
> On May 7, 2015 10:47 AM, "Adam Gilmore" <dragoncu...@gmail.com> wrote:
>
> > Hey sorry my mistake - you're right.  Didn't see it executing those in
> > TimedRunnables.  I wonder why then it's such a significant impact for
> only
> > 70 files.  I can pretty easily replicate it by using the globbing to
> select
> > just a subset, then select the whole lot (i.e. 35 files takes about 200ms
> > "planning" time).
> >
> > Sounds like caching that metadata would be a great start, though.
> > Especially with the Parquet pushdown filtering, it needs to reevaluate
> > those footers again to see how many files it may eliminate, thus it'll
> > effectively be doing this effort twice.
> >
> > I might do a bit more debugging and see if I can trap exactly where the
> > extra cost is.  I would think unless the I/O is bottlenecking or it's not
> > parallelised somewhere, it shouldn't be such a significant impact.
> >
> > On Thu, May 7, 2015 at 4:35 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
> >
> > > The read should be parallelized.  See FooterGatherer.  What makes you
> > think
> > > it isn't parallelized?
> > >
> > > We've seen this set of operations expensive in some situations and
> quite
> > > bad in the case of 100,000's of files.  We're working on improvement to
> > > this issue with this jira:
> > >
> > > https://issues.apache.org/jira/browse/DRILL-2743
> > >
> > > Note, I also think Steven has identified some places where we re-get
> > > FileStatus multiple times which can also lead to poorer start
> > performance.
> > > I"m not sure there is an issue open against this but we should get one
> > > opened and resolved.
> > >
> > > On Wed, May 6, 2015 at 11:13 PM, Adam Gilmore <dragoncu...@gmail.com>
> > > wrote:
> > >
> > > > Just a follow up - I have isolated that it is almost linear according
> > to
> > > > the number of Parquet files.  The footer read is quite expensive and
> > not
> > > > parallelised at all (it uses it for query planning).
> > > >
> > > > Is there any way to control the row group size when creating Parquet
> > > > files?  I could create fewer, larger files, but still want the
> benefit
> > of
> > > > smaller row groups (as I have just done the Parquet pushdown
> > filtering).
> > > >
> > > > On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore <dragoncu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I've been looking at the speed of some of our queries and have
> > noticed
> > > > > there is quite a significant delay to the query actually starting.
> > > > >
> > > > > For example, querying about 70 Parquet files in a directory, it
> takes
> > > > > about 370ms before it starts the first fragment.
> > > > >
> > > > > Obviously, considering it's not in the plan, it's very hard to see
> > > where
> > > > > exactly it's spending that 370ms without instrumenting/debugging.
> > > > >
> > > > > How can I troubleshoot where Drill is spending this 370ms?
> > > > >
> > > >
> > >
> >
>

Reply via email to