Re: Merging small files with dynamic partitions

2010-10-15 Thread Sammy Yu
Hi guys, Thanks for the response. I tried running without hive.mergejob.maponly with the same result. I've attached the explain extended output. I am running this query on EC2 boxes, however it's not running on EMR. Hive is running on top of a hadoop 0.20.2 setup.. Thanks, Sammy On Fri, O

Re: Merging small files with dynamic partitions

2010-10-15 Thread Ning Zhang
The output file shows it only have 2 jobs (the mapreduce job and the move task). This indicates that the plan does not have merge enabled. Merge should consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can you send the plan of the query? One thing I noticed is that you

Re: Merging small files with dynamic partitions

2010-10-15 Thread Edward Capriolo
Sammy, This is not the exact remedy you were looking for, but my company open sourced our file crusher utility. http://www.jointhegrid.com/hadoop_filecrush/index.jsp We use it to good effect to turn many small files into one. Works with text and sequence files , and custom writables. Edward On

Re: Help with last 30 day unique user query

2010-10-15 Thread Vijay
Thanks Alex! That is exactly what I thought was the limitation but wanted to make sure I'm not missing anything. On Fri, Oct 15, 2010 at 10:51 AM, Alex Boisvert wrote: > As far as I know, Hive has no built-in support for sliding-window > analytics. There is an enhancement request here: > https:

Re: Multiple insert statement and levels of aggregation

2010-10-15 Thread Alex Boisvert
Cool, I hadn't come across lateral views yet. I'll see if I can use that. thanks!! alex On Fri, Oct 15, 2010 at 11:17 AM, Ning Zhang wrote: > In the multi-insert statement, you cannot put another FROM clause. What you > can do is to put both UDTF in the FROM clause: > > FROM foo lateral view

UDAF modes

2010-10-15 Thread Alex Boisvert
Hi, I'm writing a UDAF and I'm a little unclear about the PARTIAL1, PARTIAL2, FINAL and COMPLETE modes. I've read the extent of the Javadoc ;) and looked at some of the built-in UDAFs in the Hive source tree and I'm still unclear about the properties of the input data in each aggregation step. C

Re: Multiple insert statement and levels of aggregation

2010-10-15 Thread Ning Zhang
In the multi-insert statement, you cannot put another FROM clause. What you can do is to put both UDTF in the FROM clause: FROM foo lateral view someUDTF(foo.a) as t1_a lateral view anotherUDTF(foo.a) as T2_a INSERT ... SELECT a,b,c,count(1), t1_a .. SELECT a,b,c,count(1), t2_a .. On Oct 15, 2

Multiple insert statement and levels of aggregation

2010-10-15 Thread Alex Boisvert
Hi, I'd like to write a multiple-insert select statement where I need to call different UDTFs and perform several levels of aggregation based on the result of the initial table, e.g., FROM (SELECT * from TABLE foo) foo INSERT OVERWRITE TABLE bar SELECT a, b, c, count(1) FROM (SELECT someUDTF(fo

Re: Help with last 30 day unique user query

2010-10-15 Thread Alex Boisvert
As far as I know, Hive has no built-in support for sliding-window analytics. There is an enhancement request here: https://issues.apache.org/jira/browse/HIVE-896 Without such support, the brute force way of doing things is, SELECT COUNT(DISTINCT us

Re: Help with last 30 day unique user query

2010-10-15 Thread Ning Zhang
Sorry I don't understand your question. I thought you were referring to the lack of DATE type in Hive. HiveQL has the similar syntax with SQL like count(distinct col). Your regular SQL query should work together with the help of UDFs I mentioned. On Oct 15, 2010, at 9:43 AM, Vijay wrote: Thank

Re: Help with last 30 day unique user query

2010-10-15 Thread Vijay
Thanks, Ning! Finding the date which is 30 days before/later was easy enough but my problem is beyond that. I need to find unique users based on these last 30 days for a range of days. Does that make sense? On Fri, Oct 15, 2010 at 12:10 AM, Ning Zhang wrote: > There are some UDFs that convert a

Need help to ignore corrupted gzipped files while doing a query

2010-10-15 Thread Parag Arora
Hello I have a small query and need little help on the same. I have a hive table which loads its data from files partitioned by timestamp (every 15 minutes) and placed there in gzipped format. There may be some gzip files which are corrupted (while transferring files, network error etc. may have r

Re: Got question after deploy hadoop-0.21.0

2010-10-15 Thread SingoWong
Hi, The first issue was sloved, the second warring message still existing... On Thu, Oct 14, 2010 at 5:03 PM, SingoWong wrote: > Hi, > > I got some question after deploy hadoop-0.21.0 need to help. > There is a new deploy not update, and i execute start-hdfs.sh, > start-mapred.sh, got the messa

Re: Help with last 30 day unique user query

2010-10-15 Thread Ning Zhang
There are some UDFs that convert a string to epoch time and back to a string. e.g., select from_unixtime(unix_timestamp('2010-10-10', '-MM-dd') + 60*60*24*30, '-MM-dd') from src limit 1; will given you the date which is 30 days later than 2010-10-10. On Oct 14, 2010, at 11:36 PM, Vij