I'll double check the debug logs. We're getting about a 350ms delay for 70 files, about 200ms for 35 files, about 20-30ms for 1 file.
We're using HDFS. It does't appear that it's just saturating HDFS with reads, either. On Thu, May 7, 2015 at 8:30 PM, Jacques Nadeau <jacq...@apache.org> wrote: > We log for Parquet footer reading and block Map building. What are the > reported times for each in your scenario? Are you on HDFS or MFS? > > Thx > On May 7, 2015 10:47 AM, "Adam Gilmore" <dragoncu...@gmail.com> wrote: > > > Hey sorry my mistake - you're right. Didn't see it executing those in > > TimedRunnables. I wonder why then it's such a significant impact for > only > > 70 files. I can pretty easily replicate it by using the globbing to > select > > just a subset, then select the whole lot (i.e. 35 files takes about 200ms > > "planning" time). > > > > Sounds like caching that metadata would be a great start, though. > > Especially with the Parquet pushdown filtering, it needs to reevaluate > > those footers again to see how many files it may eliminate, thus it'll > > effectively be doing this effort twice. > > > > I might do a bit more debugging and see if I can trap exactly where the > > extra cost is. I would think unless the I/O is bottlenecking or it's not > > parallelised somewhere, it shouldn't be such a significant impact. > > > > On Thu, May 7, 2015 at 4:35 PM, Jacques Nadeau <jacq...@apache.org> > wrote: > > > > > The read should be parallelized. See FooterGatherer. What makes you > > think > > > it isn't parallelized? > > > > > > We've seen this set of operations expensive in some situations and > quite > > > bad in the case of 100,000's of files. We're working on improvement to > > > this issue with this jira: > > > > > > https://issues.apache.org/jira/browse/DRILL-2743 > > > > > > Note, I also think Steven has identified some places where we re-get > > > FileStatus multiple times which can also lead to poorer start > > performance. > > > I"m not sure there is an issue open against this but we should get one > > > opened and resolved. > > > > > > On Wed, May 6, 2015 at 11:13 PM, Adam Gilmore <dragoncu...@gmail.com> > > > wrote: > > > > > > > Just a follow up - I have isolated that it is almost linear according > > to > > > > the number of Parquet files. The footer read is quite expensive and > > not > > > > parallelised at all (it uses it for query planning). > > > > > > > > Is there any way to control the row group size when creating Parquet > > > > files? I could create fewer, larger files, but still want the > benefit > > of > > > > smaller row groups (as I have just done the Parquet pushdown > > filtering). > > > > > > > > On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore <dragoncu...@gmail.com> > > > > wrote: > > > > > > > > > Hi guys, > > > > > > > > > > I've been looking at the speed of some of our queries and have > > noticed > > > > > there is quite a significant delay to the query actually starting. > > > > > > > > > > For example, querying about 70 Parquet files in a directory, it > takes > > > > > about 370ms before it starts the first fragment. > > > > > > > > > > Obviously, considering it's not in the plan, it's very hard to see > > > where > > > > > exactly it's spending that 370ms without instrumenting/debugging. > > > > > > > > > > How can I troubleshoot where Drill is spending this 370ms? > > > > > > > > > > > > > > >