[ https://issues.apache.org/jira/browse/HIVE-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552508#comment-14552508 ]
Greg Senia commented on HIVE-10746: ----------------------------------- Seems to be that single file with a group by/order by is generating 40040 splits... I think the map file is needed at this point to determine why this is happening correct? 2015-05-19 16:20:32,462 INFO [AsyncDispatcher event handler] impl.VertexImpl: Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1426958683478_171530_1_00 2015-05-19 16:20:32,707 DEBUG [InputInitializer [Map 1] #0] security.UserGroupInformation: PrivilegedAction as:gss2002 (auth:SIMPLE) from:org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239) 2015-05-19 16:20:32,708 INFO [InputInitializer [Map 1] #0] dag.RootInputInitializerManager: Starting InputInitializer for Input: crc_arsn on vertex vertex_1426958683478_171530_1_00 [Map 1] 2015-05-19 16:20:32,722 INFO [InputInitializer [Map 1] #0] log.PerfLogger: <PERFLOG method=getSplits from=org.apache.hadoop.hive.ql.io.HiveInputFormat> 2015-05-19 16:20:32,723 INFO [InputInitializer [Map 1] #0] exec.Utilities: PLAN PATH = hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/431ae2bc-ebc9-48e7-bbb3-f03144198009/hive_2015-05-19_16-20-28_783_5570914503219655045-1/gss2002/_tez_scratch_dir/9da6870e-7388-40b1-bab6-9d0f242b1702/map.xml 2015-05-19 16:20:32,723 DEBUG [InputInitializer [Map 1] #0] exec.Utilities: Found plan in cache for name: map.xml 2015-05-19 16:20:32,744 INFO [InputInitializer [Map 1] #0] exec.Utilities: Processing alias crc_arsn 2015-05-19 16:20:32,744 INFO [InputInitializer [Map 1] #0] exec.Utilities: Adding input file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn 2015-05-19 16:20:32,747 INFO [InputInitializer [Map 1] #0] io.HiveInputFormat: hive.io.file.readcolumn.ids= 2015-05-19 16:20:32,747 INFO [InputInitializer [Map 1] #0] io.HiveInputFormat: hive.io.file.readcolumn.names=,arsn_cd,appl_user_id 2015-05-19 16:20:32,747 INFO [InputInitializer [Map 1] #0] io.HiveInputFormat: Generating splits 2015-05-19 16:20:32,780 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false 2015-05-19 16:20:32,780 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = true 2015-05-19 16:20:32,780 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false 2015-05-19 16:20:32,780 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket 2015-05-19 16:20:32,781 DEBUG [InputInitializer [Map 1] #0] retry.RetryUtils: multipleLinearRandomRetry = null 2015-05-19 16:20:32,782 DEBUG [InputInitializer [Map 1] #0] ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@7879a53d 2015-05-19 16:20:32,785 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false 2015-05-19 16:20:32,785 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = true 2015-05-19 16:20:32,785 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false 2015-05-19 16:20:32,785 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket 2015-05-19 16:20:32,786 DEBUG [InputInitializer [Map 1] #0] retry.RetryUtils: multipleLinearRandomRetry = null 2015-05-19 16:20:32,786 DEBUG [InputInitializer [Map 1] #0] ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@7879a53d 2015-05-19 16:20:32,876 DEBUG [InputInitializer [Map 1] #0] mapred.FileInputFormat: Time taken to get FileStatuses: 87 2015-05-19 16:20:32,876 INFO [InputInitializer [Map 1] #0] mapred.FileInputFormat: Total input paths to process : 1 2015-05-19 16:20:32,881 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.use.legacy.blockreader.local = false 2015-05-19 16:20:32,881 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.read.shortcircuit = true 2015-05-19 16:20:32,881 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.client.domain.socket.data.traffic = false 2015-05-19 16:20:32,881 DEBUG [InputInitializer [Map 1] #0] hdfs.BlockReaderLocal: dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket 2015-05-19 16:20:32,882 DEBUG [InputInitializer [Map 1] #0] retry.RetryUtils: multipleLinearRandomRetry = null 2015-05-19 16:20:32,883 DEBUG [InputInitializer [Map 1] #0] ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@7879a53d 2015-05-19 16:20:32,907 DEBUG [InputInitializer [Map 1] #0] mapred.FileInputFormat: Total # of splits generated by getSplits: 40040, TimeTaken: 124 2015-05-19 16:20:32,916 INFO [InputInitializer [Map 1] #0] io.HiveInputFormat: number of splits 40040 2015-05-19 16:20:32,916 INFO [InputInitializer [Map 1] #0] log.PerfLogger: </PERFLOG method=getSplits start=1432066832722 end=1432066832916 duration=194 from=org.apache.hadoop.hive.ql.io.HiveInputFormat> 2015-05-19 16:20:32,917 INFO [InputInitializer [Map 1] #0] tez.HiveSplitGenerator: Number of input splits: 40040. 23542 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat 2015-05-19 16:20:32,917 INFO [InputInitializer [Map 1] #0] exec.Utilities: PLAN PATH = hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/431ae2bc-ebc9-48e7-bbb3-f03144198009/hive_2015-05-19_16-20-28_783_5570914503219655045-1/gss2002/_tez_scratch_dir/9da6870e-7388-40b1-bab6-9d0f242b1702/map.xml 2015-05-19 16:20:32,918 INFO [InputInitializer [Map 1] #0] exec.Utilities: ***************non-local mode*************** 2015-05-19 16:20:32,918 INFO [InputInitializer [Map 1] #0] exec.Utilities: local path = hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/431ae2bc-ebc9-48e7-bbb3-f03144198009/hive_2015-05-19_16-20-28_783_5570914503219655045-1/gss2002/_tez_scratch_dir/9da6870e-7388-40b1-bab6-9d0f242b1702/map.xml 2015-05-19 16:20:32,918 DEBUG [InputInitializer [Map 1] #0] exec.Utilities: Loading plan from string: /tmp/hive/gss2002/431ae2bc-ebc9-48e7-bbb3-f03144198009/hive_2015-05-19_16-20-28_783_5570914503219655045-1/gss2002/_tez_scratch_dir/9da6870e-7388-40b1-bab6-9d0f242b1702/map.xml 2015-05-19 16:20:32,919 INFO [InputInitializer [Map 1] #0] log.PerfLogger: <PERFLOG method=deserializePlan from=org.apache.hadoop.hive.ql.exec.Utilities> 2015-05-19 16:20:32,919 INFO [InputInitializer [Map 1] #0] exec.Utilities: Deserializing MapWork via kryo 2015-05-19 16:20:32,940 INFO [InputInitializer [Map 1] #0] log.PerfLogger: </PERFLOG method=deserializePlan start=1432066832919 end=1432066832940 duration=21 from=org.apache.hadoop.hive.ql.exec.Utilities> 2015-05-19 16:20:32,941 DEBUG [InputInitializer [Map 1] #0] tez.SplitGrouper: Adding split hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 to src new group? true > Hive 0.14.x and Hive 1.2.0 w/ Tez 0.5.3/Tez 0.6.0 Slow group by/order by > ------------------------------------------------------------------------ > > Key: HIVE-10746 > URL: https://issues.apache.org/jira/browse/HIVE-10746 > Project: Hive > Issue Type: Bug > Components: Hive, Tez > Affects Versions: 0.14.0, 0.14.1, 1.2.0, 1.1.0, 1.1.1 > Reporter: Greg Senia > Priority: Critical > Attachments: slow_query_output.zip > > > The following query: "SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount > FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;" runs > consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to > run this same query against Tez as the execution engine it consistently runs > for over 300-500 seconds this seems extremely long. This is a basic external > table delimited by tabs and is a single file in a folder. In Hive 0.13 this > query with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now > Hive 1.2.0 and there clearly is something going awry with Hive w/Tez as an > execution engine with Single or small file tables. I can attach further logs > if someone needs them for deeper analysis. > HDFS Output: > hadoop fs -ls /example_dw/crc/arsn > Found 2 items > -rwxr-x--- 6 loaduser hadoopusers 0 2015-05-17 20:03 > /example_dw/crc/arsn/_SUCCESS > -rwxr-x--- 6 loaduser hadoopusers 3883880 2015-05-17 20:03 > /example_dw/crc/arsn/part-m-00000 > Hive Table Describe: > hive> describe formatted crc_arsn; > OK > # col_name data_type comment > > arsn_cd string > clmlvl_cd string > arclss_cd string > arclssg_cd string > arsn_prcsr_rmk_ind string > arsn_mbr_rspns_ind string > savtyp_cd string > arsn_eff_dt string > arsn_exp_dt string > arsn_pstd_dts string > arsn_lstupd_dts string > arsn_updrsn_txt string > appl_user_id string > arsntyp_cd string > pre_d_indicator string > arsn_display_txt string > arstat_cd string > arsn_tracking_no string > arsn_cstspcfc_ind string > arsn_mstr_rcrd_ind string > state_specific_ind string > region_specific_in string > arsn_dpndnt_cd string > unit_adjustment_in string > arsn_mbr_only_ind string > arsn_qrmb_ind string > > # Detailed Table Information > Database: adw > Owner: loadu...@exa.example.com > CreateTime: Mon Apr 28 13:28:05 EDT 2014 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNAL TRUE > transient_lastDdlTime 1398706085 > > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > field.delim \t > line.delim \n > serialization.format \t > Time taken: 1.245 seconds, Fetched: 54 row(s) > Explain Hive 1.2.0 w/Tez: > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > Explain Hive 0.13 w/Tez: > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > Results: > Hive 1.2.0 w/Spark 1.3.1: > Finished successfully in 7.09 seconds > Hive 1.2.0 w/Mapreduce: > Stage 1: 32 Seconds > Stage 2: 35 Seconds > Hive 1.2.0 w/Tez 0.5.3: > Time taken: 565.025 seconds, Fetched: 11516 row(s) > > Hive 0.13 w/Tez 0.4.0: > Time taken: 13.552 seconds, Fetched: 11516 row(s) > And finally looking at the Dag Attempt that is stuck for 500 seconds or so in > Tez it looks to be stuck running the same method over and over again: > 8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor> > 2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = > hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml > 2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records > read - 1 > 2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > 2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > ...... > 2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > 2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > 2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > 2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: > Processing file > hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000 > 2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG > method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 > from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor> > 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. > closing... > 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: > RECORDS_IN_Map_1:13440 -- This message was sent by Atlassian JIRA (v6.3.4#6332)