Multi-group-by with transform leads to incorrect optimization

2014-01-23 Thread Jan Dolinár
Hello, I've encountered an issue with hive's predicate push down optimization when multi-group-by is used together with transform. Here is a simple testcase to illustrate my point: CREATE TABLE IF NOT EXISTS my_table ( id INT, property1 INT, property2 INT, count

Bug when adding multiple partitions

2013-08-14 Thread Jan Dolinár
Hi everyone, Consider following DDL: CREATE TABLE partition_test (a INT) PARTITIONED BY (b INT); ALTER TABLE partition_test ADD PARTITION (b=1) location '/tmp/test1' PARTITION (b=2) location '/tmp/test2'; Now lets have a look what was created: DESCRIBE

Re: Calling same UDF multiple times in a SELECT query

2013-07-23 Thread Jan Dolinár
Hi, If you use annotation, Hive should be able to optimize it to single call: @UDFType(deterministic = true) The obvious condition is that it must always return the identical result when called with same parameters. Little bit more on this can be found in Mark Grovers post at

Re: Run queries from external files as subqueries

2013-06-20 Thread Jan Dolinár
Quick and dirty way to do such thing would be to use some kind of preprocessor. To avoid writing one, you could use e.g. the one from GCC, with just a little help from sed: gcc -E -x c query.hql -o- | sed '/#/d' preprocessed.hql hive -f preprocessed.hql Where query.hql can contain for

Re: Loopup objects in distributed cache

2013-04-04 Thread Jan Dolinár
Hello Vivek, GenericUDTF has method initialize() which is only called once per task. So if you read your files in this method and store the structures in memory then the overhead is relatively small (reading 15MB per mapper is negligible compared to several GB of processed data). Best regards,

Re: how to make data statistics efficiency in hive?

2013-03-27 Thread Jan Dolinár
Hi Andy, I'm not sure if I entirely understood your question, but I think you're looking for something like this: select concat(date,':',uid), sum(1) as total, sum(if(a=1,1,0)) AS ca, sum(if(b=1,1,0)) AS cb, sum(if(c=1,1,0)) AS cc from mytable group by uid, date; Query like

Re: Help in hive query

2012-10-10 Thread Jan Dolinár
Hi Manu, I believe the last group by q2.auth_count is wrong, because it causes computing average only across lines with same value of q2.auth_count, which is of course equal to its value. Best regards, J. Dolinar On Wed, Oct 10, 2012 at 8:19 AM, Manu A hadoophi...@gmail.com wrote: Hi All, The

Re: Aggregate Multiple Columns

2012-08-16 Thread Jan Dolinár
Hi Richin, Do you mean summing the values from two columns into one number, or calculating sum of both columns into two sums in one query? Both is possible, the first can be done simply as SUM(col1 + col2), the second can be accomplished with two sums: sum(col1), sum(col2). Does that answer your

Re: load data inot hive using shell script

2012-08-09 Thread Jan Dolinár
I'd suggest you to read a bit about shell scripts... To make this work you have to pass the queries to the hive via -e option. The way you wrote it, interactive cli would be run and the queries would be attempted to be executed by shell when cli ends. The correct skript in your case shouild look

Re: Changing table types from managed to external

2012-08-07 Thread Jan Dolinár
Hi George, You can save yourself one copying. Just create a new external table with different name, fill it with data (either by copying or query like INSERT OVERWRITE DIRECTORY '/new/table/path' SELECT * FROM oldtable), drop the old one and then rename the new one to the desired name: ALTER

Re: Custom UserDefinedFunction in Hive

2012-08-07 Thread Jan Dolinár
? And it will not scan the whole table right?** *Raihan Jamal* On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár dolik@gmail.com wrote: Hi Jamal, Check if the function really returns what it should and that your data are really in MMdd format. You can do this by simple query like

Re: Custom UserDefinedFunction in Hive

2012-08-07 Thread Jan Dolinár
Oops, sorry I made a copypaste mistake :) The annotation should read @*UDFType(deterministic=true*) Jan On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár dolik@gmail.com wrote: I'm afraid that he query SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; will scan entire

Re: Custom UserDefinedFunction in Hive

2012-08-07 Thread Jan Dolinár
cannot use hiveconf here as I am working with Hive 0.6 Raihan Jamal On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár dolik@gmail.com wrote: I'm afraid that he query SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; will scan entire table, because

Re: Custom UserDefinedFunction in Hive

2012-08-07 Thread Jan Dolinár
) LIMIT 10; *Raihan Jamal* On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár dolik@gmail.com wrote: By the way, even without hiveconf, you can run hive from shell like this to achieve what you want using shell capabilities: hive -e CREATE TEMPORARY FUNCTION yesterdaydate

Re: Running the HiveQL from the shell prompt.

2012-08-07 Thread Jan Dolinár
The shell will interpret the query in your command as SELECT ... explode(split(timestamps, *#*)) ... if you run it the way you wrote it, i.e. without the quotation. The way around this is to either escape the quotation marks or use single quotes: hive -e **SELECT user_id ,product_id

Re: Custom UserDefinedFunction in Hive

2012-08-06 Thread Jan Dolinár
Hi Jamal, Check if the function really returns what it should and that your data are really in MMdd format. You can do this by simple query like this: SELECT dt, yesterdaydate('MMdd') FROM REALTIME LIMIT 1; I don't see anything wrong with the function itself, it works well for me

Re: Performance tuning a hive query

2012-07-19 Thread Jan Dolinár
There are many ways, but beware that some of them may result in worse performance when used inappropriately. Some of the settings we use to achieve faster queries: hive.map.aggr=true hive.exec.parallel=true hive.exec.compress.intermediate=true mapred.job.reuse.jvm.num.tasks=-1 Structuring the

Re: separator within primitive fields

2012-06-28 Thread Jan Dolinár
Hi, You can use something like where instr(field1, 'test1')=0 It is simple but not very reliable. Better solution is probably where array_contains(split(field1, ' \| '),'test1') I have not tested these, so you might need to tweak it a bit, but it is well described in manual:

Re: join string in hive udf

2012-06-26 Thread Jan Dolinár
Hi, Check the hadoop logs of the failed task. My best guess is that there is an uncaught exception thrown somewhere in your code. The logs will tell where and what caused the problem. Best regards, Jan On Tue, Jun 26, 2012 at 4:20 PM, Yue Guan pipeha...@gmail.com wrote: Hi, hive users I

UDTF fails when used in LATERAL VIEW

2012-06-21 Thread Jan Dolinár
Hi, I've hit problems when writing custom UDTF that should return string values. I couldn't find anywhere what type should have the values that get forward()ed to collector. The only info I could dig out from google was few blogs with examples and 4 UDTFs that are among the hive sources. From

Re: UDTF fails when used in LATERAL VIEW

2012-06-21 Thread Jan Dolinár
:-) Have you tried returning a Text value instead of String? Atleast in the case of UDFs, returning Text instead of Strings is possible and recommended too. I would think it would be the same case with UDTFs. Mark - Original Message - From: Jan Dolinár dolik@gmail.com To: user

Re: Quering RDBMS table in a Hive query

2012-06-15 Thread Jan Dolinár
On 6/15/12, Ruslan Al-Fakikh ruslan.al-fak...@jalent.ru wrote: I didn't know InputFormat and LineReader could help, though I didn't look at them closely. I was thinking about implementing a Table-Generating Function (UDTF) if there is no an already implemented solution. Both is possible,

Re: Quering RDBMS table in a Hive query

2012-06-14 Thread Jan Dolinár
Hi Ruslan, I've been in similar situation and solved it by writing a custom InputFormat and LineReader that loads the data from MySQL in constructor. In my case I use it just to check value ranges and similar stuff. If you want to join the data with whats in your hdfs files, you can do that as

Re: Multi-group-by select always scans entire table

2012-06-07 Thread Jan Dolinár
On 6/7/12, Mark Grover grover.markgro...@gmail.com wrote: Can you please check if predicate push down enabled changes the explain plan on a simple inner join query like: select a.* from a inner join b on(a.key=b.key) where a.some_col=blah; No problem, I ran following as you suggested (INNER

Re: Multi-group-by select always scans entire table

2012-06-07 Thread Jan Dolinár
Thank you very much Mark for your investigation and explanations. I'm well aware of the fact that hadoop 0.7.1 is quite an old code and that newer version might perform better - that is the main reason I discussed it here instead of reporting it as a bug. For now it doesn't bother me, as I have

Re: Multi-group-by select always scans entire table

2012-06-06 Thread Jan Dolinár
columns. If it doesn't work for both partition and non-partition columns, then obviously Hive-1056 is not working for you. We can take it up on the mailing list from there. Thanks for your input, Jan. Mark On Tue, Jun 5, 2012 at 1:19 AM, Jan Dolinár dolik@gmail.com wrote: On Mon

Re: Multi-GroupBy-Insert optimization

2012-06-05 Thread Jan Dolinár
or even make the calculation longer. Hope that helps... I can't think about anything else that could help you. Jan On 6/5/12, Jan Dolinár dolik@gmail.com wrote: On 6/4/12, shan s mysub...@gmail.com wrote: Thanks for the explanation Jan. If I understand correctly, the input will be read one

Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread Jan Dolinár
On Fri, Jun 1, 2012 at 5:25 PM, shan s mysub...@gmail.com wrote: I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job which would club the group-bys together. However it is scheduling n jobs where n = number of group bys.. Could you please explain this behaviour. No,

Re: Multi-group-by select always scans entire table

2012-06-04 Thread Jan Dolinár
On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover mgro...@oanda.com wrote: Hi Jan, Glad you found something workable. What version of Hive are you using? Could you also please check what the value of the property hive.optimize.ppd is for you? Thanks, Mark Hi Mark, Thanks for reply. I'm using

Re: Multi-GroupBy-Insert optimization

2012-06-04 Thread Jan Dolinár
On 6/4/12, shan s mysub...@gmail.com wrote: Thanks for the explanation Jan. If I understand correctly, the input will be read one single time and will be preprocessed in some form, and this intermediate data is used for subsequent group-by.. Not sure if my scenario will help this single

Re: Multi-group-by select always scans entire table

2012-05-28 Thread Jan Dolinár
On Fri, May 25, 2012 at 12:03 PM, Jan Dolinár dolik@gmail.com wrote: -- see what happens when you try to perform multi-group-by query on one of the partitions EXPLAIN EXTENDED FROM partition_test LATERAL VIEW explode(col1) tmp AS exp_col1 INSERT OVERWRITE DIRECTORY '/test/1' SELECT

Multi-group-by select always scans entire table

2012-05-25 Thread Jan Dolinár
Hello, I've encountered a weird issue with hive and I'm not sure if I'm doing something wrong or if it is a bug. I'm trying to do a multi-group-by select statement on a partitioned table. I wan't only data from one partition, therefore all the WHERE statements are exactly the same and contain