Re: Suggestion needed for UNION ALL performance in Apache drill

sreeparna bhabani Wed, 06 May 2020 04:13:37 -0700

Hi Paul,

Thanks for the analysis you did. The issue occurs when we UNION ALL between
Parquet and database. Here find below more details on the scenario-


The size of the folder-
849 MB
The total number of parquet files under this folder - 19
Total number of rows in this folder - 25245 K
Selected rows from Parquet after filter- 169 K
Selected rows from database - 50

I have tested the select query from database with same number of rows
(16402 K). But it is not parallelized for database query. In fact I tried
DB UNION ALL DB also. But it is not parallelized ( each of the DB query has
109 K rows).

I tried "Parquet JOIN database" for testing purpose to check whether it is
parallelized during joining. Although the JDBC_SCAN was not parallelized,
PARQUET_ROW_GROUP_SCAN parallelized here.

FYI - The planner.slice_target = 100000

I have raised one JIRA ticket. Please find the same -
<https://issues.apache.org/jira/browse/DRILL-7720>
https://issues.apache.org/jira/browse/DRILL-7720

The query we used-
SELECT column1, column2,...column25
FROM    dfs.`root`.`Parquet`
WHERE column1 = <> AND column2 = <> AND column3 = <>
UNION ALL
SELECT  column1, column2,...column25
FROM    db.`Database`
WHERE column1 = <> AND column2 = <> AND column3 = <>

Below is the Plan after redacted names-
00-00 Screen
00-01 Project(Column1=[$0],...,Column25=[$24])
00-02 UnionExchange
01-01 UnionAll(all=[true])
01-03 Project (Column1=[$0],...,Column25=[$24])
01-05 SelectionVectorRemover
01-06
Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>,
numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false,
filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
01-02  Project(Column1=[$0],...,Column25=[$24])
01-04 Jdbc()

json-
{
"head" : {
"version":1,
"generator" : {"type":"ExplainHandler","info":""},
"type" : "APACHE_DRILL_PHYSICAL","options" :
[],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
},
"graph":[
{"pop":"jdbc-scan"},
{"pop":"project","initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":1},
{"pop":"parquet-scan",
"cost":{"memoryCost":4194304,"outputRowCount":1.2287038E7} },
{"pop":"filter" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
},
{"pop":"selection-vector-remover" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":41468.75
}  ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}
{"pop":"union-all" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}    ,
{"pop":"union-exchange" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75
} }    ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true
},
{"pop":"screen" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}
]
}

Please let me know if any other information you need.

Thanks,
Sreeparna Bhabani

On Wed, May 6, 2020 at 2:05 PM sreeparna bhabani <
[email protected]> wrote:

>
> ---------- Forwarded message ---------
> From: Paul Rogers <[email protected]>
> Date: Tue, 5 May 2020, 13:21
> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
> To: <[email protected]>
> Cc: Sreeparna Bhabani <[email protected]>, <[email protected]>,
> <[email protected]>
>
>
> Hi Sreeparna,
>
>
> Thanks much for digging into the details. SQL is pretty complex and things
> don't always work as we might expect.
>
>
> The first question is: which plan is correct: the (Parquet UNION ALL DB)
> plan or the (Parquet UNION ALL Parquet) plan? I tried poking around, and
> got no definitive answer on whether UNION ALL implies ordering. On the
> one hand, the parts of the standards that I could find didn't seem to imply
> ordering. A UNION (no ALL) can't imply ordering since it essentially does
> an anti-join which may be hash-partitioned. But, a StackOverflow post
> suggested that there is an implied ordering in the case of an ORDER BY on
> the sub-queries:
>
>
> (SELECT ... ORDER BY ...)
>
> UNION AL
>
> (SELECT ... ORDER BY ...)
>
>
> That is, if we can sort each sub-query, then doing so only makes sense if
> all results of the first are returned before any results of the second.
> (Have not checked if the above is valid in Drill. Even if it was, the
> planner should handle the above as a special case.)
>
>
> So, let's assume that the (Parquet UNION ALL Parquet) case is the correct
> behavior. Then, we can speculate that the planner is getting confused
> somehow in the mixed case. Each data source (file for Parquet, JDBC for the
> DB) parallelizes independently. Each decides it needs just one fragment.
> Somehow the planner must be saying, "well, if they both only want one
> fragment, let's run the whole query in a single fragment."
>
>
> Perhaps the decision is based on row count and the planner somehow thinks
> the row counts will be small for one or both of the queries. In fact, what
> happens if you do a query with (DB UNION ALL DB)? Drill's ability to
> estimate row counts is poor, especially from JDBC. Perhaps the planner is
> guessing the tables are small and parallelizing is unnecessary.
>
>
> My advice is to file a JIRA ticket with as much detail as you can provide.
> Certainly the information from your-email. Ideally, also the query (with
> names redacted if necessary.) Also, the JSON query plan (obtained by using
> EXPLAIN PLAN FOR), again with names redacted if necessary.
>
>
> With that info, we can dig a bit deeper to determine why the two cases
> come out differently.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Monday, May 4, 2020, 9:39:12 AM PDT, sreeparna bhabani <
> [email protected]> wrote:
>
>
> Hi Team,
>
> After further checking on this UNION ALL, I found that UNION ALL
> (between Parquet and database) behaves as expected with limited number of
> rows and columns. But for a larger Parquet file and higher number of
> selected rows and columns, the UNION ALL takes much higher time than sum of
> the same of individual Parquet and DB Query.
>
> As per the analysis, it looks like the source of this issue is-
> Although we are using distributed mode, the UNION ALL query is executed
> only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
> parallelized in multiple nodes.
>
> Whereas, for individual query or UNION ALL between same type datasets
> (Parquet + Parquet) it is getting distributed in 2 NODES.
>
> Do you have any finding / idea on this ?
>
> Thanks,
> Sreeparna Bhabani
>
> On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
> [email protected]> wrote:
>
> > Hi Paul Team,
> >
> > Please check the observation mentioned in the  below Jira where we found
> > that UNION ALL query is not parallelized between multiple nodes when
> there
> > are 2 types dataset (Parquet and Database). But it is parallelized if we
> > query individual Parquet file.
> >
> > Is there any way to enforce parallel execution in multiple nodes ?
> >
> > Thanks,
> > Sreeparna Bhabani
> >
> >
> > On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <
> [email protected]>
> > wrote:
> >
> >>
> >> Hi Paul and Team,
> >>
> >> As you suggested I have created a Jira ticket which is  -
> >> https://issues.apache.org/jira/browse/DRILL-7720.
> >> I have mentioned details in the Jira you asked. Please have a look. As
> >> the data is sensitive, I am trying to create dummy dataset. Will
> >> provide once it is ready.
> >>
> >> Thanks,
> >> Sreeparna Bhabani
> >>
> >> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
> >> [email protected]> wrote:
> >>
> >>>
> >>> ---------- Forwarded message ---------
> >>> From: Paul Rogers <[email protected]>
> >>> Date: Thu, 23 Apr 2020, 23:59
> >>> Subject: Re: Suggestion needed for UNION ALL performance in Apache
> drill
> >>> To: <[email protected]>, sreeparna bhabani <
> >>> [email protected]>
> >>> Cc: <[email protected]>, <[email protected]>
> >>>
> >>>
> >>> Hi Sreeparna,
> >>>
> >>>
> >>> As suggested in the earlier e-mail, we would not expect to see
> different
> >>> performance in UNION ALL than in a simple scan. Clearly you've found
> some
> >>> kind of issue. The next step is to investigate that issue, which is a
> bit
> >>> hard to do over e-mail.
> >>>
> >>>
> >>> Please file a JIRA ticket to describe the issue and provide a
> >>> reproducible test case including query and data. If your data is
> sensitive,
> >>> please create a dummy data set, or use the provided TPC-H data set to
> >>> recreate the issue. We can then take a look to see what might be
> happening.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
> >>> [email protected]> wrote:
> >>>
> >>>
> >>> Hi Team,
> >>>
> >>> In addition to the below mail I have another finding. Please consider
> >>> below scenarios. The first 2 scenarios are giving expected results in
> terms
> >>> of performance. But we are not getting expected performance for 3rd
> >>> scenario which is UNION ALL with 2 different types of datasets.
> >>>
> >>> *Scenario 1- Parquet UNION ALL Parquet*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 5 secs
> >>> UNION ALL of both queries execution time - 10 secs
> >>>
> >>> *Scenario 2 - DB query UNION ALL DB* *query*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 5 secs
> >>> UNION ALL of both queries execution time - 10 secs
> >>>
> >>> *Scenario 3 - Parquet UNION ALL DB query*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 1 sec
> >>> UNION ALL execution time - 20 secs
> >>> Ideally the execution time should not be more than 6 secs.
> >>>
> >>> May I request you to check whether the UNION ALL performance of 3rd
> >>> scenario is expected with different dataset types.
> >>>
> >>> Please suggest if there is any specific way to bring down the execution
> >>> time of 3rd scenario.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>>
> >>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
> >>> [email protected]> wrote:
> >>>
> >>> Hi Team,
> >>>
> >>> Apart from the below issue I have another question.
> >>>
> >>> Is there any relation between number of row groups and performance ?
> >>>
> >>> In the below query the number of files is 13 and numRowGroups is 69. Is
> >>> the UNION ALL takes more time if the number of rowgroup is high like
> that.
> >>>
> >>> Please note that the individual Parquet query takes 6 secs. But UNION
> >>> ALL takes 20 secs. Details are given in trail mail.
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <[email protected]>
> >>> wrote:
> >>>
> >>> Hi Paul,
> >>>
> >>> Please find the details below. We are using 2 drillbits. Heap memory 16
> >>> G, Max direct memory 32 G. One query selects from Parquet. Another one
> >>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL.
> There
> >>> is not sorting.
> >>>
> >>> Single parquet query-
> >>> Total execution time - 6.6 sec
> >>> Scan time - 0.152 sec
> >>> Screen wait time - 5.3 sec
> >>>
> >>> Single JDBC query-
> >>> Total execution time - 0.261 sec
> >>> JDBC scan - 0.152 sec
> >>> Screen wait - 0.004 sec
> >>>
> >>>
> >>> Union all query -
> >>> Execution time - 21. 118 sec
> >>> Screen wait time - 5.351 sec
> >>> Parquet scan - 15.368 sec
> >>> Unordered receiver wait time - 14.41 sec
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <[email protected]> wrote:
> >>>
> >>> Hi Sreeparna,
> >>>
> >>>
> >>> The short answer is it *should* work: a UNION ALL is simply an append.
> >>> (Be sure you are not using a plain UNION as that needs to do more work
> to
> >>> remove duplicates.)
> >>>
> >>>
> >>> Since you are seeing unexpected behavior, we may have some kind of
> issue
> >>> to investigate and perhaps fix. Always hard to do over e-mail, but
> let's
> >>> see what we can do.
> >>>
> >>>
> >>> The first question is to understand the full query: are you doing more
> >>> than a simple scan of two files and a UNION ALL? Are there sorts or
> joins
> >>> involved?
> >>>
> >>>
> >>> The best place to start to investigate performance issues is the query
> >>> profile, which it looks like you are doing. What is the time for the
> scans
> >>> if you run each of the two scans separately? You said that they take 8
> and
> >>> 1 seconds. Is that for the whole query or just the scan operators?
> >>>
> >>>
> >>> Then, when you run the UNION ALL, again looking at the scan operators,
> >>> is there any difference in run times? If the scans take longer, that
> is one
> >>> thing to investigate. If the scans take the same amount of time, what
> other
> >>> operator(s) are taking the rest of the time? Your note suggests that
> it is
> >>> the scan taking the time. But, there should be two scan operators: one
> for
> >>> each file. How is the time divided between them?
> >>>
> >>>
> >>> How large are the data files? Using what storage system? How many
> >>> Drillbits? How much memory?
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
> >>> [email protected]> wrote:
> >>>
> >>>
> >>> Hi Team,
> >>>
> >>> I reach out to you for a specific problem regarding UNION ALL. There is
> >>> one
> >>> UNION ALL statement which combines 2 queries. The individual queries
> are
> >>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
> >>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
> >>> 1.17.
> >>>
> >>> Please help to suggest how to improve this UNION ALL performance. We
> are
> >>> using parquet file.
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>
> >> --
> >>
> >> Thanks n Regards,
>
> >> *Sreeparna Bhabani*
> >>
> >
>
> --
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>


-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

Reply via email to