Re: Suggestion needed for UNION ALL performance in Apache drill

sreeparna bhabani Tue, 28 Apr 2020 08:31:11 -0700

Hi Paul Team,

Please check the observation mentioned in the  below Jira where we found
that UNION ALL query is not parallelized between multiple nodes when there
are 2 types dataset (Parquet and Database). But it is parallelized if we
query individual Parquet file.


Is there any way to enforce parallel execution in multiple nodes ?

Thanks,
Sreeparna Bhabani


On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <[email protected]>
wrote:

>
> Hi Paul and Team,
>
> As you suggested I have created a Jira ticket which is  -
> https://issues.apache.org/jira/browse/DRILL-7720.
> I have mentioned details in the Jira you asked. Please have a look. As the
> data is sensitive, I am trying to create dummy dataset. Will provide once
> it is ready.
>
> Thanks,
> Sreeparna Bhabani
>
> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
> [email protected]> wrote:
>
>>
>> ---------- Forwarded message ---------
>> From: Paul Rogers <[email protected]>
>> Date: Thu, 23 Apr 2020, 23:59
>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>> To: <[email protected]>, sreeparna bhabani <
>> [email protected]>
>> Cc: <[email protected]>, <[email protected]>
>>
>>
>> Hi Sreeparna,
>>
>>
>> As suggested in the earlier e-mail, we would not expect to see different
>> performance in UNION ALL than in a simple scan. Clearly you've found some
>> kind of issue. The next step is to investigate that issue, which is a bit
>> hard to do over e-mail.
>>
>>
>> Please file a JIRA ticket to describe the issue and provide a
>> reproducible test case including query and data. If your data is sensitive,
>> please create a dummy data set, or use the provided TPC-H data set to
>> recreate the issue. We can then take a look to see what might be happening.
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>> [email protected]> wrote:
>>
>>
>> Hi Team,
>>
>> In addition to the below mail I have another finding. Please consider
>> below scenarios. The first 2 scenarios are giving expected results in terms
>> of performance. But we are not getting expected performance for 3rd
>> scenario which is UNION ALL with 2 different types of datasets.
>>
>> *Scenario 1- Parquet UNION ALL Parquet*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 5 secs
>> UNION ALL of both queries execution time - 10 secs
>>
>> *Scenario 2 - DB query UNION ALL DB* *query*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 5 secs
>> UNION ALL of both queries execution time - 10 secs
>>
>> *Scenario 3 - Parquet UNION ALL DB query*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 1 sec
>> UNION ALL execution time - 20 secs
>> Ideally the execution time should not be more than 6 secs.
>>
>> May I request you to check whether the UNION ALL performance of 3rd
>> scenario is expected with different dataset types.
>>
>> Please suggest if there is any specific way to bring down the execution
>> time of 3rd scenario.
>>
>> Thanks in advance.
>>
>> Sreeparna Bhabani
>>
>>
>>
>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>> [email protected]> wrote:
>>
>> Hi Team,
>>
>> Apart from the below issue I have another question.
>>
>> Is there any relation between number of row groups and performance ?
>>
>> In the below query the number of files is 13 and numRowGroups is 69. Is
>> the UNION ALL takes more time if the number of rowgroup is high like that.
>>
>> Please note that the individual Parquet query takes 6 secs. But UNION ALL
>> takes 20 secs. Details are given in trail mail.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <[email protected]>
>> wrote:
>>
>> Hi Paul,
>>
>> Please find the details below. We are using 2 drillbits. Heap memory 16
>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>> is not sorting.
>>
>> Single parquet query-
>> Total execution time - 6.6 sec
>> Scan time - 0.152 sec
>> Screen wait time - 5.3 sec
>>
>> Single JDBC query-
>> Total execution time - 0.261 sec
>> JDBC scan - 0.152 sec
>> Screen wait - 0.004 sec
>>
>>
>> Union all query -
>> Execution time - 21. 118 sec
>> Screen wait time - 5.351 sec
>> Parquet scan - 15.368 sec
>> Unordered receiver wait time - 14.41 sec
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>>
>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <[email protected]> wrote:
>>
>> Hi Sreeparna,
>>
>>
>> The short answer is it *should* work: a UNION ALL is simply an append.
>> (Be sure you are not using a plain UNION as that needs to do more work to
>> remove duplicates.)
>>
>>
>> Since you are seeing unexpected behavior, we may have some kind of issue
>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>> see what we can do.
>>
>>
>> The first question is to understand the full query: are you doing more
>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>> involved?
>>
>>
>> The best place to start to investigate performance issues is the query
>> profile, which it looks like you are doing. What is the time for the scans
>> if you run each of the two scans separately? You said that they take 8 and
>> 1 seconds. Is that for the whole query or just the scan operators?
>>
>>
>> Then, when you run the UNION ALL, again looking at the scan operators, is
>> there any difference in run times? If the scans take longer, that is one
>> thing to investigate. If the scans take the same amount of time, what other
>> operator(s) are taking the rest of the time? Your note suggests that it is
>> the scan taking the time. But, there should be two scan operators: one for
>> each file. How is the time divided between them?
>>
>>
>> How large are the data files? Using what storage system? How many
>> Drillbits? How much memory?
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>> [email protected]> wrote:
>>
>>
>> Hi Team,
>>
>> I reach out to you for a specific problem regarding UNION ALL. There is
>> one
>> UNION ALL statement which combines 2 queries. The individual queries are
>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>> 1.17.
>>
>> Please help to suggest how to improve this UNION ALL performance. We are
>> using parquet file.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>>
>
> --
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>

Re: Suggestion needed for UNION ALL performance in Apache drill

Reply via email to