Re: merge small orc files

2015-04-21 Thread patcharee

Hi Gopal,

The table created is not  a bucketed table, but a dynamic partitioned 
table. I took the script test from

- create table orc_merge5 (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) stored as orc;
- create table orc_merge5a (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) partitioned by (st double) 
stored as orc;

I sent you the desc formatted table and application log. I just found 
out that there are some TezException which could be the cause of the 
problem. Please let me know how to fix it.



On 21. april 2015 13:10, Gopal Vijayaraghavan wrote:

alter table  concatenate do not work? I have a dynamic
partitioned table (stored as orc). I tried to alter concatenate, but it
did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.

hive> dfs -ls
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33

Is this a bucketed table?

When you look at the point of view of split generation & cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId 
and the desc formatted of the table, which will help me understand what¹s
happening better.


Container: container_1424363133313_0082_01_03 on compute-test-1-2.testlocal_45454
Log Upload Time:21-Apr-2015 14:17:54
Log Contents:

Log Upload Time:21-Apr-2015 14:17:54
Log Contents:
0.294: [GC [PSYoungGen: 3642K->490K(6656K)] 3642K->1308K(62976K), 0.0071100 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
0.600: [GC [PSYoungGen: 6110K->496K(12800K)] 6929K->1992K(69120K), 0.0058540 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
1.061: [GC [PSYoungGen: 10217K->496K(12800K)] 11714K->3626K(69120K), 0.0077230 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
1.477: [GC [PSYoungGen: 8914K->512K(25088K)] 12045K->5154K(81408K), 0.0095740 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2.361: [GC [PSYoungGen: 14670K->512K(25088K)] 19313K->6827K(81408K), 0.0106680 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
3.476: [GC [PSYoungGen: 22967K->3059K(51712K)] 29282K->9958K(108032K), 0.0201770 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
5.538: [GC [PSYoungGen: 50438K->3568K(52224K)] 57336K->15383K(108544K), 0.0374340 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 
6.811: [GC [PSYoungGen: 29358K->6331K(61440K)] 41173K->18282K(117760K), 0.0421300 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
7.689: [GC [PSYoungGen: 28530K->6401K(61440K)] 40482K->19476K(117760K), 0.0443730 secs] [Times: user=0.03 sys=0.00, real=0.05 secs] 
 PSYoungGen  total 61440K, used 28333K [0xfbb8, 0x0001, 0x0001)
  eden space 54784K, 40% used [0xfbb8,0xfdf463f8,0xff10)
lgrp 0 space 2K, 49% used [0xfbb8,0xfc95add8,0xfd7b6000)
lgrp 1 space 25896K, 29% used [0xfd7b6000,0xfdf463f8,0xff10)
  from space 6656K, 96% used [0xff10,0xff740400,0xff78)
  to   space 8704K, 0% used [0xff78,0xff78,0x0001)
 ParOldGen   total 56320K, used 13075K [0xd9a0, 0xdd10, 0xfbb8)
  object space 56320K, 23% used [0xd9a0,0xda6c4e68,0xdd10)
 PSPermGen   total 28672K, used 28383K [0xd480, 0xd640, 0xd9a0)
  object space 28672K, 98% used [0xd480,0xd63b7f78,0xd640)

Log Upload Time:21-Apr-2015 14:17:54
Log Contents:
2015-04-21 14:17:40,208 INFO [main] task.TezChild: TezChild starting
2015-04-21 14:17:41,856 INFO [main] task.TezChild: PID, containerIdentifier:  15169, container_1424363133313_0082_01_03
2015-04-21 14:17:41,985 INFO [main] impl.MetricsConfig: loaded properties from
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: TezTask metrics system started
2015-04-21 14:17:42,355 INFO [TezChild] task.ContainerReporter: Attempting to fe

Re: merge small orc files

2015-04-21 Thread Gopal Vijayaraghavan

>alter table  concatenate do not work? I have a dynamic
>partitioned table (stored as orc). I tried to alter concatenate, but it
>did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.

>hive> dfs -ls 
>Found 2 items
>-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
>-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33

Is this a bucketed table?

When you look at the point of view of split generation & cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId 
and the desc formatted of the table, which will help me understand what¹s
happening better.


Re: merge small orc files

2015-04-21 Thread patcharee

Hi Gopal,

Thanks for your explanation.

What could be the case that SET hive.merge.orcfile.stripe.level=true && 
alter table  concatenate do not work? I have a dynamic 
partitioned table (stored as orc). I tried to alter concatenate, but it 
did not work. See my test result.

hive> SET hive.merge.orcfile.stripe.level=true;
hive> alter table orc_merge5a partition(st=0.8) concatenate;
Starting Job = job_1424363133313_0053, Tracking URL = 
Kill Command = /usr/hdp/ job  -kill 

Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-04-21 12:32:56,165 null map = 0%,  reduce = 0%
2015-04-21 12:33:05,964 null map = 100%,  reduce = 0%
Ended Job = job_1424363133313_0053
Loading data to table default.orc_merge5a partition (st=0.8)
to trash at: 
to trash at: 
Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, 
totalSize=1067, rawDataSize=0]

MapReduce Jobs Launched:
Stage-null:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
Time taken: 22.839 seconds
hive> dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33 
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33 

It seems nothing happened when I altered table concatenate. Any ideas?


On 21. april 2015 04:41, Gopal Vijayaraghavan wrote:


How to set the configuration hive-site.xml to automatically merge small
orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.

-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator & handled by the MR outputcommitter instead of the Hive

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table  concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.


Re: merge small orc files

2015-04-20 Thread Gopal Vijayaraghavan

>How to set the configuration hive-site.xml to automatically merge small
>orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.

>-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator & handled by the MR outputcommitter instead of the Hive

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table  concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.


Re: merge small orc files

2015-04-20 Thread Xuefu Zhang
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize.

On Mon, Apr 20, 2015 at 8:29 AM, patcharee 

> Hi,
> How to set the configuration hive-site.xml to automatically merge small
> orc file (output from mapreduce job) in hive 0.14 ?
> This is my current configuration>
>   hive.merge.mapfiles
>   true
>   hive.merge.mapredfiles
>   true
>   hive.merge.orcfile.stripe.level
>   true
> However the output from a mapreduce job, which is stored into an orc file,
> was not merged. This is the output>
> -rwxr-xr-x   1 root hdfs  0 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/_SUCCESS
> -rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-0
> -rwxr-xr-x   1 root hdfs  29049 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-1
> -rwxr-xr-x   1 root hdfs  29075 2015-04-20 15:23
> /apps/hive/warehouse/coordinate/zone=2/part-r-2
> Any ideas?
> BR,
> Patcharee