Hi Paul and Team, As you suggested I have created a Jira ticket which is - https://issues.apache.org/jira/browse/DRILL-7720. I have mentioned details in the Jira you asked. Please have a look. As the data is sensitive, I am trying to create dummy dataset. Will provide once it is ready.
Thanks, Sreeparna Bhabani On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani < [email protected]> wrote: > > ---------- Forwarded message --------- > From: Paul Rogers <[email protected]> > Date: Thu, 23 Apr 2020, 23:59 > Subject: Re: Suggestion needed for UNION ALL performance in Apache drill > To: <[email protected]>, sreeparna bhabani < > [email protected]> > Cc: <[email protected]>, <[email protected]> > > > Hi Sreeparna, > > > As suggested in the earlier e-mail, we would not expect to see different > performance in UNION ALL than in a simple scan. Clearly you've found some > kind of issue. The next step is to investigate that issue, which is a bit > hard to do over e-mail. > > > Please file a JIRA ticket to describe the issue and provide a reproducible > test case including query and data. If your data is sensitive, please > create a dummy data set, or use the provided TPC-H data set to recreate the > issue. We can then take a look to see what might be happening. > > > Thanks, > > - Paul > > > > On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani < > [email protected]> wrote: > > > Hi Team, > > In addition to the below mail I have another finding. Please consider > below scenarios. The first 2 scenarios are giving expected results in terms > of performance. But we are not getting expected performance for 3rd > scenario which is UNION ALL with 2 different types of datasets. > > *Scenario 1- Parquet UNION ALL Parquet* > Individual execution time of 1st query - 5 secs > Individual execution time of 2nd query - 5 secs > UNION ALL of both queries execution time - 10 secs > > *Scenario 2 - DB query UNION ALL DB* *query* > Individual execution time of 1st query - 5 secs > Individual execution time of 2nd query - 5 secs > UNION ALL of both queries execution time - 10 secs > > *Scenario 3 - Parquet UNION ALL DB query* > Individual execution time of 1st query - 5 secs > Individual execution time of 2nd query - 1 sec > UNION ALL execution time - 20 secs > Ideally the execution time should not be more than 6 secs. > > May I request you to check whether the UNION ALL performance of 3rd > scenario is expected with different dataset types. > > Please suggest if there is any specific way to bring down the execution > time of 3rd scenario. > > Thanks in advance. > > Sreeparna Bhabani > > > > On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <[email protected]> > wrote: > > Hi Team, > > Apart from the below issue I have another question. > > Is there any relation between number of row groups and performance ? > > In the below query the number of files is 13 and numRowGroups is 69. Is > the UNION ALL takes more time if the number of rowgroup is high like that. > > Please note that the individual Parquet query takes 6 secs. But UNION ALL > takes 20 secs. Details are given in trail mail. > > Thanks, > Sreeparna Bhabani > > On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <[email protected]> > wrote: > > Hi Paul, > > Please find the details below. We are using 2 drillbits. Heap memory 16 G, > Max direct memory 32 G. One query selects from Parquet. Another one selects > fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not > sorting. > > Single parquet query- > Total execution time - 6.6 sec > Scan time - 0.152 sec > Screen wait time - 5.3 sec > > Single JDBC query- > Total execution time - 0.261 sec > JDBC scan - 0.152 sec > Screen wait - 0.004 sec > > > Union all query - > Execution time - 21. 118 sec > Screen wait time - 5.351 sec > Parquet scan - 15.368 sec > Unordered receiver wait time - 14.41 sec > > Thanks, > Sreeparna Bhabani > > > On Thu, 23 Apr 2020, 10:43 Paul Rogers, <[email protected]> wrote: > > Hi Sreeparna, > > > The short answer is it *should* work: a UNION ALL is simply an append. (Be > sure you are not using a plain UNION as that needs to do more work to > remove duplicates.) > > > Since you are seeing unexpected behavior, we may have some kind of issue > to investigate and perhaps fix. Always hard to do over e-mail, but let's > see what we can do. > > > The first question is to understand the full query: are you doing more > than a simple scan of two files and a UNION ALL? Are there sorts or joins > involved? > > > The best place to start to investigate performance issues is the query > profile, which it looks like you are doing. What is the time for the scans > if you run each of the two scans separately? You said that they take 8 and > 1 seconds. Is that for the whole query or just the scan operators? > > > Then, when you run the UNION ALL, again looking at the scan operators, is > there any difference in run times? If the scans take longer, that is one > thing to investigate. If the scans take the same amount of time, what other > operator(s) are taking the rest of the time? Your note suggests that it is > the scan taking the time. But, there should be two scan operators: one for > each file. How is the time divided between them? > > > How large are the data files? Using what storage system? How many > Drillbits? How much memory? > > > Thanks, > > - Paul > > > > On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani < > [email protected]> wrote: > > > Hi Team, > > I reach out to you for a specific problem regarding UNION ALL. There is one > UNION ALL statement which combines 2 queries. The individual queries are > taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs. > PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is > 1.17. > > Please help to suggest how to improve this UNION ALL performance. We are > using parquet file. > > Thanks, > Sreeparna Bhabani > > -- Thanks n Regards, *Sreeparna Bhabani*
