I will write up a defect. The first test you suggested below - running the
query on just one of our Parquet files produces the same result (10-12
minute planning time). However, the second test - using
cp.`tpch/nation.parquet` - results in a planning time of only about a
minute. So, I'm not sure how to interpret that. What does that mean to you
all?

- Dave

On Tue, Feb 14, 2017 at 12:37 PM, Jinfeng Ni <j...@apache.org> wrote:

> Normally, the slow query planning could be caused by :
>
> 1. Some planner rule hit a bug when processing certain operators in
> the query, for instance join operator, distinct aggregate.  The query
> I tried on a small file seems to rule out this possibility.
> 2. The parquet metadata access time. According to the long, this does
> not seem to be the issue.
> 3. Something we are not aware of.
>
> To help get some clue, can you help do the following:
> 1. run the query over one single parquet files, in stead of 100
> parquet files? You can change using
> dfs.`parquet/transaction/OneSingleFile.parquet`. I'm wondering if the
> planning time is proportional to # of parquet files.
>
> 2. What if you try your query by replacing
> dfs.`parquet/transaction/OneSingleFile.parquet` with
> cp.`tpch/nation.parquet` which is a small tpch parquet file (you need
> re-enable the storage plugin 'cp')? Run EXPLAIN should be fine. This
> will tell us if the problem is caused by the parquet source, or the
> query itself.
>
> Yes, please create a defect in Drill JIRA.
>
> On Tue, Feb 14, 2017 at 5:02 AM, David Kincaid <kincaid.d...@gmail.com>
> wrote:
> > Thank you for the feedback. It seems there is nothing more I can do on my
> > end. What are my next steps? Shall I create a defect in the Drill Jira?
> >
> > - Dave
> >
> > On Mon, Feb 13, 2017 at 5:13 PM, Jinfeng Ni <j...@apache.org> wrote:
> >
> >> The size of parquet files will matter in terms of meta data access
> >> time, which is just 212 ms according to your log file. My
> >> understanding is it does not matter too much to the overall planning
> >> times. That's why it probably makes sense to try over such a small toy
> >> example.
> >>
> >> Normally the planning time for such simple query should be much
> >> shorter than 12 minutes.  It indicates it could be caused by a code
> >> bug, or something else that we are currently unaware of.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Feb 13, 2017 at 2:47 PM, David Kincaid <kincaid.d...@gmail.com>
> >> wrote:
> >> > The example in DRILL-5183 is just a very small toy example to
> demonstrate
> >> > the bug with how Drill reads Parquet array fields. It doesn't have
> >> anything
> >> > to do with this planning issue (at least I don't think it does). Sorry
> >> if I
> >> > confused things with that reference.
> >> >
> >> > I just tried running our query directly against the table at
> >> > dfs.`parquet/transaction` and get the same result (12 minutes of
> planning
> >> > time). I disabled the cp and s3 storage plugins that were enabled so
> that
> >> > only the dfs storage plugin is enabled and the result is the same.
> >> >
> >> > Is this expected for Drill to take this long in the planning phase
> for a
> >> > query? Is there anything else I can try or information I could
> provide to
> >> > help identify the bug (seems like a bug to me)? I really appreciate
> you
> >> > guys helping out so quickly this afternoon.
> >> >
> >> > - Dave
> >> >
> >> > On Mon, Feb 13, 2017 at 4:13 PM, Jinfeng Ni <j...@apache.org> wrote:
> >> >
> >> >> I downloaded books.parquet from DRILL-5183, and created a view on top
> >> >> of this single parquet file. Then, run EXPLAIN for the query, and it
> >> >> completes within 1.2 seconds on Drill 1.8.0 release. (The # of
> parquet
> >> >> files would impact the time to fetch metadata. Since it's not the
> >> >> bottleneck in this case, it should not cause a big difference).
> >> >>
> >> >> Do you see the long planning time issue for this query only, or it
> >> >> happens for other queries as well? Besides the possibility of
> planning
> >> >> rule bugs, we once saw another possible cause of long planning issue.
> >> >> In your storage plugin configuration, if you enable some other
> storage
> >> >> plugin (for instance, hbase, or hive etc) which are slow to access,
> >> >> then those un-relevant storage plugin might impact your query as
> well.
> >> >> You may temporarily disable those storage plugins, and see if it's
> the
> >> >> cause of the problem.
> >> >>
> >> >> 0: jdbc:drill:zk=local> explain plan for
> >> >> . . . . . . . . . . . > select fltb1.sapId, yearmo,
> >> >> . . . . . . . . . . . > COUNT(*) as totalcnt,
> >> >> . . . . . . . . . . . > count(distinct(CASE
> >> >> . . . . . . . . . . . >                WHEN
> >> >> . . . . . . . . . . . >                (REPEATED_CONTAINS(fltb1.
> >> >> classLabels,
> >> >> . . . . . . . . . . . >
> >> >> 'Thing:Service:MedicalService:Diagnostic:Radiology:Ultrasound.*'))
> >> >> . . . . . . . . . . . >                THEN fltb1.invoiceId
> >> >> . . . . . . . . . . . >                END)) as ultracount,
> >> >> . . . . . . . . . . . > count(distinct (CASE
> >> >> . . . . . . . . . . . >                 WHEN
> >> >> . . . . . . . . . . . >                 (REPEATED_CONTAINS(fltb1.
> >> >> classLabels,
> >> >> . . . . . . . . . . . >
> >> >> 'Thing:Service:MedicalService:Diagnostic:LaboratoryTest.*'))
> >> >> . . . . . . . . . . . >                 THEN fltb1.invoiceId
> >> >> . . . . . . . . . . . >                 END)) as labcount
> >> >> . . . . . . . . . . . > from (
> >> >> . . . . . . . . . . . >   select sapid, invoiceId,
> >> >> . . . . . . . . . . . >         TO_CHAR(TO_TIMESTAMP(
> transactionDate,
> >> >> 'YYYY-MM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM') yearmo,
> >> >> . . . . . . . . . . . >         classLabels
> >> >> . . . . . . . . . . . >       from dfs.tmp.transactionView) fltb1
> >> >> . . . . . . . . . . . > group by fltb1.sapId, yearmo;
> >> >> +------+------+
> >> >> | text | json |
> >> >> +------+------+
> >> >> | 00-00    Screen
> >> >> 00-01      Project(sapId=[$0], yearmo=[$1], totalcnt=[$2],
> >> >> ultracount=[$3], labcount=[$4])
> >> >> ....................................
> >> >> 00-09                SelectionVectorRemover
> >> >> 00-12                  Sort(sort0=[$0], sort1=[$1], dir0=[ASC],
> >> dir1=[ASC])
> >> >> 00-15                    HashAgg(group=[{0, 1}], totalcnt=[COUNT()])
> >> >> ................................
> >> >> 00-22                          Scan(groupscan=[ParquetGroupScan
> >> >> [entries=[ReadEntryWithPath [path=file:/tmp/parquet/transaction]],
> >> >> selectionRoot=file:/tmp/parquet/transaction, numFiles=1,
> >> >> usedMetadataFile=false, columns=[`sapId`, `invoiceId`,
> >> >> `transactionDate`, `classLabels`.`array`]]])
> >> >>
> >> >> 1 row selected (1.195 seconds)
> >> >>
> >> >>
> >> >> On Mon, Feb 13, 2017 at 1:51 PM, David Kincaid <
> kincaid.d...@gmail.com>
> >> >> wrote:
> >> >> > Here is the entire transactionView.view.drill file. As you can see
> the
> >> >> view
> >> >> > itself is very simple and is just wrapping a syntactic problem with
> >> the
> >> >> > array field. That's an issue I reported in Jira under DRILL-5183 (
> >> >> > https://issues.apache.org/jira/browse/DRILL-5183)
> >> >> >
> >> >> > {
> >> >> >   "name" : "transactionView",
> >> >> >   "sql" : "SELECT `transactionRowKey`, `sapId`, `practiceName`,
> >> >> > `practiceCity`, `practiceState`, `practicePostalCode`, `animalId`,
> >> >> > `dateOfBirth`, `species`, `breed`, `gender`, `status`, `ownerId`,
> >> >> > `itemType`, `classification`, `subclass`, `practiceDescription`,
> >> >> > `clientDescription`, `invoiceId`, `unitOfMeasure`, `vendorName`,
> >> >> `vaccine`,
> >> >> > `rabies`, `vaccineType`, `price`, `quantity`, `transactionDate`,
> >> >> > `visitReason`, `speciesCode`, `genderCode`,
> >> `t`.`classLabels`['array'] AS
> >> >> > `classLabels`\nFROM `dfs`.`/parquet/transaction` AS `t`",
> >> >> >   "fields" : [ {
> >> >> >     "name" : "transactionRowKey",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "sapId",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "practiceName",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "practiceCity",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "practiceState",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "practicePostalCode",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "animalId",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "dateOfBirth",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "species",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "breed",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "gender",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "status",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "ownerId",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "itemType",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "classification",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "subclass",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "practiceDescription",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "clientDescription",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "invoiceId",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "unitOfMeasure",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "vendorName",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "vaccine",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "rabies",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "vaccineType",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "price",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "quantity",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "transactionDate",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "visitReason",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "speciesCode",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "genderCode",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   }, {
> >> >> >     "name" : "classLabels",
> >> >> >     "type" : "ANY",
> >> >> >     "isNullable" : true
> >> >> >   } ],
> >> >> >   "workspaceSchemaPath" : [ ]
> >> >> > }
> >> >> >
> >> >> > On Mon, Feb 13, 2017 at 3:47 PM, Jinfeng Ni <j...@apache.org>
> wrote:
> >> >> >
> >> >> >> Yes, the log confirmed that the planning, especially physical
> >> >> >> planning, is the one that took most of the time.
> >> >> >>
> >> >> >> If the definition of view s3.cisexport.transactionView is not very
> >> >> >> complicated (involves large # of tables), then it's possible that
> >> some
> >> >> >> planner rules have a bug. (In the past, we once saw couple of
> planner
> >> >> >> rules would be fired in a loop).
> >> >> >>
> >> >> >> Is it possible that you can share the DDL of the view?  That may
> help
> >> >> >> us re-produce the problem and take a look at the trace of Calcite,
> >> >> >> which Drill uses as the query planner.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >>
>

Reply via email to