[ 
https://issues.apache.org/jira/browse/DRILL-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945559#comment-14945559
 ] 

Steven Phillips commented on DRILL-3901:
----------------------------------------

I'm not sure about doing the directory expansion twice, but I do know that in 
the case where there is a metadata file, we are loading the file twice. The 
first time we read the metadata file, we should pass the metadata object to 
ParquetGroupScan, and continue passing the metadata object to any clones of the 
ParquetGroupScan, so that we don't have to read and deserialize the file more 
than once. I didn't think this was a big enough deal to stop the release, but 
it looking at these numbers, it might be worth fixing now rather than putting 
off to the next release.

> Performance regression with doing Explain of COUNT(*) over 100K files
> ---------------------------------------------------------------------
>
>                 Key: DRILL-3901
>                 URL: https://issues.apache.org/jira/browse/DRILL-3901
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Aman Sinha
>            Assignee: Mehant Baid
>
> We are seeing a performance regression when doing an Explain of SELECT 
> COUNT(*) over 100K files in a flat directory (no subdirectories) on latest 
> master branch compared to a run that was done on Sept 26.   Some initial 
> details (I will have more later): 
> {code}
> master branch on Sept 26
>    No metadata cache: 71.452 secs
>    With metadata cache: 15.804 secs
> Latest master branch 
>    No metadata cache: 110 secs
>    With metadata cache: 32 secs
> {code}
> So, both cases show regression.  
> [~mehant] and I took an initial look at this and it appears we might be doing 
> the directory expansion twice.  
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to