[ https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637501#comment-15637501 ]
ASF GitHub Bot commented on DRILL-4982: --------------------------------------- Github user chunhui-shi commented on the issue: https://github.com/apache/drill/pull/638 The performance degradation is worse with larger tables. The degradation is >300% on a 15GB table. After this fix, this degradation disappears. There are other visible performance gains of this fix even when there is no degradation: For a simple query on ORC/Parquet table through HiveReaders, I observed the improvement is about 10%-25% and the average improved percentage is 17.6%. For TPCH tests on a 10 node cluster with Parquet tables + HiveReaders, the average improvement (of two runs) is 5.1% > Hive Queries degrade when queries switch between different formats > ------------------------------------------------------------------ > > Key: DRILL-4982 > URL: https://issues.apache.org/jira/browse/DRILL-4982 > Project: Apache Drill > Issue Type: Bug > Reporter: Chunhui Shi > Assignee: Jinfeng Ni > Priority: Critical > > We have seen degraded performance by doing these steps: > 1) generate the repro data: > python script repro.py as below: > import string > import random > > for i in range(30000000): > x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ > in range(random.randrange(19, 27))) > print > "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6) > python repro.py > repro.csv > 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under > hive prompt, use the following sql command to create an external table: > CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, > `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION > '/tmp/hiveworkspace/plain' > 3) create Hive's table of ORC|PARQUET format: > CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT > id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`; > CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT > id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`; > 4) Query switch between these two tables, then the query time on the same > table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. > Queries on table of other formats, after injecting a query to other formats, > all have significant slow down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)