Re: merge small orc files

2015-04-21 Thread patcharee

Hi Gopal,

Thanks for your explanation.

What could be the case that SET hive.merge.orcfile.stripe.level=true  
alter table table concatenate do not work? I have a dynamic 
partitioned table (stored as orc). I tried to alter concatenate, but it 
did not work. See my test result.


hive SET hive.merge.orcfile.stripe.level=true;
hive alter table orc_merge5a partition(st=0.8) concatenate;
Starting Job = job_1424363133313_0053, Tracking URL = 
http://service-test-1-2.testlocal:8088/proxy/application_1424363133313_0053/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job  -kill 
job_1424363133313_0053

Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-04-21 12:32:56,165 null map = 0%,  reduce = 0%
2015-04-21 12:33:05,964 null map = 100%,  reduce = 0%
Ended Job = job_1424363133313_0053
Loading data to table default.orc_merge5a partition (st=0.8)
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/00_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Moved: 
'hdfs://service-test-1-0.testlocal:8020/apps/hive/warehouse/orc_merge5a/st=0.8/02_0' 
to trash at: 
hdfs://service-test-1-0.testlocal:8020/user/patcharee/.Trash/Current
Partition default.orc_merge5a{st=0.8} stats: [numFiles=2, numRows=0, 
totalSize=1067, rawDataSize=0]

MapReduce Jobs Launched:
Stage-null:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 22.839 seconds
hive dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33 
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0


It seems nothing happened when I altered table concatenate. Any ideas?

BR,
Patcharee

On 21. april 2015 04:41, Gopal Vijayaraghavan wrote:

Hi,


How to set the configuration hive-site.xml to automatically merge small
orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.


-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
/apps/hive/warehouse/coordinate/zone=2/part-r-0

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator  handled by the MR outputcommitter instead of the Hive
MoveTask.

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table table concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.

Cheers,
Gopal






mapred.reduce.tasks

2015-04-21 Thread Shushant Arora
In MapReduce job how reduce tasks numbers are decided ?
I haven't override the mapred.reduce.tasks property and its creating ~700
reduce tasks.

Thanks


Re: merge small orc files

2015-04-21 Thread patcharee

Hi Gopal,

The table created is not  a bucketed table, but a dynamic partitioned 
table. I took the script test from 
https://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/orc_merge7.q


- create table orc_merge5 (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) stored as orc;
- create table orc_merge5a (userid bigint, string1 string, subtype 
double, decimal1 decimal, ts timestamp) partitioned by (st double) 
stored as orc;


I sent you the desc formatted table and application log. I just found 
out that there are some TezException which could be the cause of the 
problem. Please let me know how to fix it.


BR,

Patcharee


On 21. april 2015 13:10, Gopal Vijayaraghavan wrote:



alter table table concatenate do not work? I have a dynamic
partitioned table (stored as orc). I tried to alter concatenate, but it
did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.


hive dfs -ls
${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0

Is this a bucketed table?

When you look at the point of view of split generation  cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId app-id
and the desc formatted of the table, which will help me understand what¹s
happening better.

Cheers,
Gopal






Container: container_1424363133313_0082_01_03 on compute-test-1-2.testlocal_45454
===
LogType:stderr
Log Upload Time:21-Apr-2015 14:17:54
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:21-Apr-2015 14:17:54
LogLength:2124
Log Contents:
0.294: [GC [PSYoungGen: 3642K-490K(6656K)] 3642K-1308K(62976K), 0.0071100 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
0.600: [GC [PSYoungGen: 6110K-496K(12800K)] 6929K-1992K(69120K), 0.0058540 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
1.061: [GC [PSYoungGen: 10217K-496K(12800K)] 11714K-3626K(69120K), 0.0077230 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
1.477: [GC [PSYoungGen: 8914K-512K(25088K)] 12045K-5154K(81408K), 0.0095740 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2.361: [GC [PSYoungGen: 14670K-512K(25088K)] 19313K-6827K(81408K), 0.0106680 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
3.476: [GC [PSYoungGen: 22967K-3059K(51712K)] 29282K-9958K(108032K), 0.0201770 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
5.538: [GC [PSYoungGen: 50438K-3568K(52224K)] 57336K-15383K(108544K), 0.0374340 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 
6.811: [GC [PSYoungGen: 29358K-6331K(61440K)] 41173K-18282K(117760K), 0.0421300 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
7.689: [GC [PSYoungGen: 28530K-6401K(61440K)] 40482K-19476K(117760K), 0.0443730 secs] [Times: user=0.03 sys=0.00, real=0.05 secs] 
Heap
 PSYoungGen  total 61440K, used 28333K [0xfbb8, 0x0001, 0x0001)
  eden space 54784K, 40% used [0xfbb8,0xfdf463f8,0xff10)
lgrp 0 space 2K, 49% used [0xfbb8,0xfc95add8,0xfd7b6000)
lgrp 1 space 25896K, 29% used [0xfd7b6000,0xfdf463f8,0xff10)
  from space 6656K, 96% used [0xff10,0xff740400,0xff78)
  to   space 8704K, 0% used [0xff78,0xff78,0x0001)
 ParOldGen   total 56320K, used 13075K [0xd9a0, 0xdd10, 0xfbb8)
  object space 56320K, 23% used [0xd9a0,0xda6c4e68,0xdd10)
 PSPermGen   total 28672K, used 28383K [0xd480, 0xd640, 0xd9a0)
  object space 28672K, 98% used [0xd480,0xd63b7f78,0xd640)

LogType:syslog
Log Upload Time:21-Apr-2015 14:17:54
LogLength:1355
Log Contents:
2015-04-21 14:17:40,208 INFO [main] task.TezChild: TezChild starting
2015-04-21 14:17:41,856 INFO [main] task.TezChild: PID, containerIdentifier:  15169, container_1424363133313_0082_01_03
2015-04-21 14:17:41,985 INFO [main] impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
2015-04-21 14:17:42,146 INFO [main] impl.MetricsSystemImpl: TezTask metrics system started
2015-04-21 14:17:42,355 INFO [TezChild] task.ContainerReporter: Attempting to fetch new 

MapredContext not available when tez enabled

2015-04-21 Thread Frank Luo
We have a UDF to collect some counts during Hive execution. It has been working 
fine until tez is enabled.

A bit digging shows that GenericUDF#configure method was not called. So in this 
case, is it possible to get counters through other means, or we have to 
implement Counter concept ourselves?

Thanks in advance


Re: MapredContext not available when tez enabled

2015-04-21 Thread Gopal Vijayaraghavan

 
 A bit digging shows that GenericUDF#configure method was not called. So
in this case, is it possible to get counters through other means, or we
have to implement Counter concept ourselves?

You should be getting a TezContext object there (which inherits from
MapRedContext).

And the method should get called depending on a needConfigure() check - if
it is not getting called, that is very strange.

Cheers,
Gopal




RE: mapred.reduce.tasks

2015-04-21 Thread Rohith Sharma K S
Hi

In MapReduce , number of reducers launched is set by property 
“mapreduce.job.reduces”. And via java API Job#setNumReduceTasks(int tasks).

In your MR job, somewhere in the program  they are setting number of reducer 
task using java API or property. May be you can check the MR job or property 
file or configuration file which are loading at client.

Thanks  Regards
Rohith Sharma K S

From: Shushant Arora [mailto:shushantaror...@gmail.com]
Sent: 21 April 2015 18:26
To: user@hive.apache.org
Subject: mapred.reduce.tasks

In MapReduce job how reduce tasks numbers are decided ?
I haven't override the mapred.reduce.tasks property and its creating ~700 
reduce tasks.

Thanks


RE: UDF cannot be found when the query is submitted via templeton

2015-04-21 Thread Xiaoyong Zhu
What we do is to upload the script first into HDFS, then use the file option 
in WebHCat to submit the queries - so I think the REST call should not 
matter... right?

Xiaoyong

From: Eugene Koifman [mailto:ekoif...@hortonworks.com]
Sent: Monday, April 20, 2015 10:32 PM
To: user@hive.apache.org
Subject: Re: UDF cannot be found when the query is submitted via templeton

can you give the complete REST call you are making to submit the query?

From: Xiaoyong Zhu xiaoy...@microsoft.commailto:xiaoy...@microsoft.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Sunday, April 19, 2015 at 8:23 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: RE: UDF cannot be found when the query is submitted via templeton

No, it doesn't...

What surprises me is that for HDInsight (Hadoop on Azure) which is using Azure 
BLOB storage, using

ADD JAR wasb:///test/HiveUDF.jar;
CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern'
select count(FindPat(columnname)) from table1;

would work. However, for my own cluster,

ADD JAR hdfs:///test/HiveUDF.jar;
CREATE TEMPORARY FUNCTION FindPat as 'HiveUDF.FindPattern'
select count(FindPat(columnname)) from table1;

does not work...

Xiaoyong

From: Jason Dere [mailto:jd...@hortonworks.com]
Sent: Saturday, April 18, 2015 1:37 AM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: UDF cannot be found when the query is submitted via templeton

Does fully qualifying the function name (HiveUDF.FindPattern()) in the query 
help here?

On Apr 17, 2015, at 6:44 AM, Xiaoyong Zhu 
xiaoy...@microsoft.commailto:xiaoy...@microsoft.com wrote:



Hi experts

I am trying to use an UDF (I have already put that in the metastore using 
CREATE FUNCTION) as following.

select count(FindPattern(s_sitename)) AS testcol from weblogs;

However, when I tried to use the UDF from WebHCat (i.e. submit the above 
command via WebHCat), the job always fails saying

Added [/tmp/2cb22c27-72d3-4b41-aea0-655df1192872_resources/HiveUDF.jar] to 
class path
Added resources: [hdfs://PATHTOFOLDER/Portal-Queries/HiveUDF.jar]
FAILED: SemanticException [Error 10011]: Line 1:13 Invalid function FindPattern

If I execute this command through Hive CLI (through hive -f file or execute it 
in the interactive shell) the statement above works. From the log I can see the 
jar file is added but it seems the function cannot be found. Can someone help 
to share some thoughts on this issue?

Btw, the create function statement is as following (changing the hdfs URI to 
full path does not work either):
CREATE FUNCTION FindPattern AS 'HiveUDF.FindPattern' USING 
JAR'hdfs:///UDFFolder/HiveUDF.jar'hdfs://UDFFolder/HiveUDF.jar';

Thanks in advance!

Xiaoyong




Re: merge small orc files

2015-04-21 Thread Gopal Vijayaraghavan


alter table table concatenate do not work? I have a dynamic
partitioned table (stored as orc). I tried to alter concatenate, but it
did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.

hive dfs -ls 
${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0

Is this a bucketed table?

When you look at the point of view of split generation  cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId app-id
and the desc formatted of the table, which will help me understand what¹s
happening better.

Cheers,
Gopal