Re: merge small orc files
Hi Gopal, The table created is not a bucketed table, but a dynamic partitioned table. I took the script test from https://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/orc_merge7.q - create table orc_merge5 (userid bigint, string1 string, subtype double, decimal1 decimal, ts timestamp) stored as orc; - create table orc_merge5a (userid bigint, string1 string, subtype double, decimal1 decimal, ts timestamp) partitioned by (st double) stored as orc; I sent you the desc formatted table and application log. I just found out that there are some TezException which could be the cause of the problem. Please let me know how to fix it. BR, Patcharee On 21. april 2015 13:10, Gopal Vijayaraghavan wrote: alter table concatenate do not work? I have a dynamic partitioned table (stored as orc). I tried to alter concatenate, but it did not work. See my test result. ORC fast concatenate does work on partitioned tables, but it doesn¹t work on bucketed tables. Bucketed tables cannot merge files, since the file count is capped by the numBuckets parameter. hive> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/; Found 2 items -rw-r--r-- 3 patcharee hdfs534 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/00_0 -rw-r--r-- 3 patcharee hdfs533 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/01_0 Is this a bucketed table? When you look at the point of view of split generation & cluster parallelism, bucketing is an anti-pattern, since in most query schemas it significantly slows down the slowest task. Making the fastest task faster isn¹t often worth it, if the overall query time goes up. Also if you want to, you can send me the yarn logs -applicationId and the desc formatted of the table, which will help me understand what¹s happening better. Cheers, Gopal Container: container_1424363133313_0082_01_03 on compute-test-1-2.testlocal_45454 === LogType:stderr Log Upload Time:21-Apr-2015 14:17:54 LogLength:0 Log Contents: LogType:stdout Log Upload Time:21-Apr-2015 14:17:54 LogLength:2124 Log Contents: 0.294: [GC [PSYoungGen: 3642K->490K(6656K)] 3642K->1308K(62976K), 0.0071100 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 0.600: [GC [PSYoungGen: 6110K->496K(12800K)] 6929K->1992K(69120K), 0.0058540 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 1.061: [GC [PSYoungGen: 10217K->496K(12800K)] 11714K->3626K(69120K), 0.0077230 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 1.477: [GC [PSYoungGen: 8914K->512K(25088K)] 12045K->5154K(81408K), 0.0095740 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 2.361: [GC [PSYoungGen: 14670K->512K(25088K)] 19313K->6827K(81408K), 0.0106680 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 3.476: [GC [PSYoungGen: 22967K->3059K(51712K)] 29282K->9958K(108032K), 0.0201770 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 5.538: [GC [PSYoungGen: 50438K->3568K(52224K)] 57336K->15383K(108544K), 0.0374340 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 6.811: [GC [PSYoungGen: 29358K->6331K(61440K)] 41173K->18282K(117760K), 0.0421300 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 7.689: [GC [PSYoungGen: 28530K->6401K(61440K)] 40482K->19476K(117760K), 0.0443730 secs] [Times: user=0.03 sys=0.00, real=0.05 secs] Heap PSYoungGen total 61440K, used 28333K [0xfbb8, 0x0001, 0x0001) eden space 54784K, 40% used [0xfbb8,0xfdf463f8,0xff10) lgrp 0 space 2K, 49% used [0xfbb8,0xfc95add8,0xfd7b6000) lgrp 1 space 25896K, 29% used [0xfd7b6000,0xfdf463f8,0xff10) from space 6656K, 96% used [0xff10,0xff740400,0xff78) to space 8704K, 0% used [0xff78,0xff78,0x0001) ParOldGen total 56320K, used 13075K [0xd9a0, 0xdd10, 0xfbb8) object space 56320K, 23% used [0xd9a0,0xda6c4e68,0xdd10) PSPermGen total 28672K, used 28383K [0xd480, 0xd640, 0xd9a0) object space 28672K, 98% used [0xd480,0xd63b7f78,0xd640) LogType:syslog Log Upload Time:21-Apr-2015 14:17:54 LogLength:1355 Log Contents: 2015-04-21 14:17:40,208 INFO [main] task.TezChild: TezChild starting 2015-04-21 14:17:41,856 INFO [main] task.TezChild: PID, containerIdentifier: 15169, container_1424363133313_0082_01_03 2015-04-21 14:17:41,985 INFO [main] impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: TezTask metrics system started 2015-04-21 14:17:42,355 INFO [TezChild] task.ContainerReporter: Attempting to fe
Re: merge small orc files
>alter table concatenate do not work? I have a dynamic >partitioned table (stored as orc). I tried to alter concatenate, but it >did not work. See my test result. ORC fast concatenate does work on partitioned tables, but it doesn¹t work on bucketed tables. Bucketed tables cannot merge files, since the file count is capped by the numBuckets parameter. >hive> dfs -ls >${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/; >Found 2 items >-rw-r--r-- 3 patcharee hdfs534 2015-04-21 12:33 >/apps/hive/warehouse/orc_merge5a/st=0.8/00_0 >-rw-r--r-- 3 patcharee hdfs533 2015-04-21 12:33 >/apps/hive/warehouse/orc_merge5a/st=0.8/01_0 Is this a bucketed table? When you look at the point of view of split generation & cluster parallelism, bucketing is an anti-pattern, since in most query schemas it significantly slows down the slowest task. Making the fastest task faster isn¹t often worth it, if the overall query time goes up. Also if you want to, you can send me the yarn logs -applicationId and the desc formatted of the table, which will help me understand what¹s happening better. Cheers, Gopal
Re: merge small orc files
Hi Gopal, Thanks for your explanation. What could be the case that SET hive.merge.orcfile.stripe.level=true && alter table concatenate do not work? I have a dynamic partitioned table (stored as orc). I tried to alter concatenate, but it did not work. See my test result. hive> SET hive.merge.orcfile.stripe.level=true; hive> alter table orc_merge5a partition(st=0.8) concatenate; Starting Job = job_1424363133313_0053, Tracking URL = http://service-test-1-2.testlocal:8088/proxy/application_1424363133313_0053/ Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job -kill job_1424363133313_0053 Hadoop job information for null: number of mappers: 0; number of reducers: 0 2015-04-21 12:32:56,165 null map = 0%, reduce = 0% 2015-04-21 12:33:05,964 null map = 100%, reduce = 0% Ended Job = job_1424363133313_0053 Loading data to table default.orc_merge5a partition (st=0.8) Moved: 'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/00_0' to trash at: hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current Moved: 'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/02_0' to trash at: hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, totalSize=1067, rawDataSize=0] MapReduce Jobs Launched: Stage-null: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 22.839 seconds hive> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/; Found 2 items -rw-r--r-- 3 patcharee hdfs534 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/00_0 -rw-r--r-- 3 patcharee hdfs533 2015-04-21 12:33 /apps/hive/warehouse/orc_merge5a/st=0.8/01_0 It seems nothing happened when I altered table concatenate. Any ideas? BR, Patcharee On 21. april 2015 04:41, Gopal Vijayaraghavan wrote: Hi, How to set the configuration hive-site.xml to automatically merge small orc file (output from mapreduce job) in hive 0.14 ? Hive cannot add work-stages to a map-reduce job. Hive follows merge.mapfiles=true when Hive generates a plan, by adding more work to the plan as a conditional task. -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 /apps/hive/warehouse/coordinate/zone=2/part-r-0 This looks like it was written by an MRv2 Reducer and not by the Hive FileSinkOperator & handled by the MR outputcommitter instead of the Hive MoveTask. But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If that is true (like your setting), then do ³alter table concatenate² which effectively concatenates ORC blocks (without decompressing them), while maintaining metadata linkage of start/end offsets in the footer. Cheers, Gopal
Re: merge small orc files
Hi, >How to set the configuration hive-site.xml to automatically merge small >orc file (output from mapreduce job) in hive 0.14 ? Hive cannot add work-stages to a map-reduce job. Hive follows merge.mapfiles=true when Hive generates a plan, by adding more work to the plan as a conditional task. >-rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 >/apps/hive/warehouse/coordinate/zone=2/part-r-0 This looks like it was written by an MRv2 Reducer and not by the Hive FileSinkOperator & handled by the MR outputcommitter instead of the Hive MoveTask. But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If that is true (like your setting), then do ³alter table concatenate² which effectively concatenates ORC blocks (without decompressing them), while maintaining metadata linkage of start/end offsets in the footer. Cheers, Gopal
Re: merge small orc files
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize. On Mon, Apr 20, 2015 at 8:29 AM, patcharee wrote: > Hi, > > How to set the configuration hive-site.xml to automatically merge small > orc file (output from mapreduce job) in hive 0.14 ? > > This is my current configuration> > > > hive.merge.mapfiles > true > > > > hive.merge.mapredfiles > true > > > > hive.merge.orcfile.stripe.level > true > > > However the output from a mapreduce job, which is stored into an orc file, > was not merged. This is the output> > > -rwxr-xr-x 1 root hdfs 0 2015-04-20 15:23 > /apps/hive/warehouse/coordinate/zone=2/_SUCCESS > -rwxr-xr-x 1 root hdfs 29072 2015-04-20 15:23 > /apps/hive/warehouse/coordinate/zone=2/part-r-0 > -rwxr-xr-x 1 root hdfs 29049 2015-04-20 15:23 > /apps/hive/warehouse/coordinate/zone=2/part-r-1 > -rwxr-xr-x 1 root hdfs 29075 2015-04-20 15:23 > /apps/hive/warehouse/coordinate/zone=2/part-r-2 > > Any ideas? > > BR, > Patcharee >