Hi Paul Team, Please check the observation mentioned in the below Jira where we found that UNION ALL query is not parallelized between multiple nodes when there are 2 types dataset (Parquet and Database). But it is parallelized if we query individual Parquet file.
Is there any way to enforce parallel execution in multiple nodes ? Thanks, Sreeparna Bhabani On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <[email protected]> wrote: > > Hi Paul and Team, > > As you suggested I have created a Jira ticket which is - > https://issues.apache.org/jira/browse/DRILL-7720. > I have mentioned details in the Jira you asked. Please have a look. As the > data is sensitive, I am trying to create dummy dataset. Will provide once > it is ready. > > Thanks, > Sreeparna Bhabani > > On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani < > [email protected]> wrote: > >> >> ---------- Forwarded message --------- >> From: Paul Rogers <[email protected]> >> Date: Thu, 23 Apr 2020, 23:59 >> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill >> To: <[email protected]>, sreeparna bhabani < >> [email protected]> >> Cc: <[email protected]>, <[email protected]> >> >> >> Hi Sreeparna, >> >> >> As suggested in the earlier e-mail, we would not expect to see different >> performance in UNION ALL than in a simple scan. Clearly you've found some >> kind of issue. The next step is to investigate that issue, which is a bit >> hard to do over e-mail. >> >> >> Please file a JIRA ticket to describe the issue and provide a >> reproducible test case including query and data. If your data is sensitive, >> please create a dummy data set, or use the provided TPC-H data set to >> recreate the issue. We can then take a look to see what might be happening. >> >> >> Thanks, >> >> - Paul >> >> >> >> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani < >> [email protected]> wrote: >> >> >> Hi Team, >> >> In addition to the below mail I have another finding. Please consider >> below scenarios. The first 2 scenarios are giving expected results in terms >> of performance. But we are not getting expected performance for 3rd >> scenario which is UNION ALL with 2 different types of datasets. >> >> *Scenario 1- Parquet UNION ALL Parquet* >> Individual execution time of 1st query - 5 secs >> Individual execution time of 2nd query - 5 secs >> UNION ALL of both queries execution time - 10 secs >> >> *Scenario 2 - DB query UNION ALL DB* *query* >> Individual execution time of 1st query - 5 secs >> Individual execution time of 2nd query - 5 secs >> UNION ALL of both queries execution time - 10 secs >> >> *Scenario 3 - Parquet UNION ALL DB query* >> Individual execution time of 1st query - 5 secs >> Individual execution time of 2nd query - 1 sec >> UNION ALL execution time - 20 secs >> Ideally the execution time should not be more than 6 secs. >> >> May I request you to check whether the UNION ALL performance of 3rd >> scenario is expected with different dataset types. >> >> Please suggest if there is any specific way to bring down the execution >> time of 3rd scenario. >> >> Thanks in advance. >> >> Sreeparna Bhabani >> >> >> >> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, < >> [email protected]> wrote: >> >> Hi Team, >> >> Apart from the below issue I have another question. >> >> Is there any relation between number of row groups and performance ? >> >> In the below query the number of files is 13 and numRowGroups is 69. Is >> the UNION ALL takes more time if the number of rowgroup is high like that. >> >> Please note that the individual Parquet query takes 6 secs. But UNION ALL >> takes 20 secs. Details are given in trail mail. >> >> Thanks, >> Sreeparna Bhabani >> >> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <[email protected]> >> wrote: >> >> Hi Paul, >> >> Please find the details below. We are using 2 drillbits. Heap memory 16 >> G, Max direct memory 32 G. One query selects from Parquet. Another one >> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There >> is not sorting. >> >> Single parquet query- >> Total execution time - 6.6 sec >> Scan time - 0.152 sec >> Screen wait time - 5.3 sec >> >> Single JDBC query- >> Total execution time - 0.261 sec >> JDBC scan - 0.152 sec >> Screen wait - 0.004 sec >> >> >> Union all query - >> Execution time - 21. 118 sec >> Screen wait time - 5.351 sec >> Parquet scan - 15.368 sec >> Unordered receiver wait time - 14.41 sec >> >> Thanks, >> Sreeparna Bhabani >> >> >> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <[email protected]> wrote: >> >> Hi Sreeparna, >> >> >> The short answer is it *should* work: a UNION ALL is simply an append. >> (Be sure you are not using a plain UNION as that needs to do more work to >> remove duplicates.) >> >> >> Since you are seeing unexpected behavior, we may have some kind of issue >> to investigate and perhaps fix. Always hard to do over e-mail, but let's >> see what we can do. >> >> >> The first question is to understand the full query: are you doing more >> than a simple scan of two files and a UNION ALL? Are there sorts or joins >> involved? >> >> >> The best place to start to investigate performance issues is the query >> profile, which it looks like you are doing. What is the time for the scans >> if you run each of the two scans separately? You said that they take 8 and >> 1 seconds. Is that for the whole query or just the scan operators? >> >> >> Then, when you run the UNION ALL, again looking at the scan operators, is >> there any difference in run times? If the scans take longer, that is one >> thing to investigate. If the scans take the same amount of time, what other >> operator(s) are taking the rest of the time? Your note suggests that it is >> the scan taking the time. But, there should be two scan operators: one for >> each file. How is the time divided between them? >> >> >> How large are the data files? Using what storage system? How many >> Drillbits? How much memory? >> >> >> Thanks, >> >> - Paul >> >> >> >> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani < >> [email protected]> wrote: >> >> >> Hi Team, >> >> I reach out to you for a specific problem regarding UNION ALL. There is >> one >> UNION ALL statement which combines 2 queries. The individual queries are >> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs. >> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is >> 1.17. >> >> Please help to suggest how to improve this UNION ALL performance. We are >> using parquet file. >> >> Thanks, >> Sreeparna Bhabani >> >> > > -- > > Thanks n Regards, > *Sreeparna Bhabani* >
