Hello,
I've encountered an issue with hive's predicate push down optimization when
multi-group-by is used together with transform. Here is a simple testcase
to illustrate my point:
CREATE TABLE IF NOT EXISTS my_table (
id INT,
property1 INT,
property2 INT,
count
Hi everyone,
Consider following DDL:
CREATE TABLE partition_test
(a INT)
PARTITIONED BY (b INT);
ALTER TABLE partition_test ADD
PARTITION (b=1) location '/tmp/test1'
PARTITION (b=2) location '/tmp/test2';
Now lets have a look what was created:
DESCRIBE
Hi,
If you use annotation, Hive should be able to optimize it to single call:
@UDFType(deterministic = true)
The obvious condition is that it must always return the identical result
when called with same parameters.
Little bit more on this can be found in Mark Grovers post at
Quick and dirty way to do such thing would be to use some kind of
preprocessor. To avoid writing one, you could use e.g. the one from GCC,
with just a little help from sed:
gcc -E -x c query.hql -o- | sed '/#/d' preprocessed.hql
hive -f preprocessed.hql
Where query.hql can contain for
Hello Vivek,
GenericUDTF has method initialize() which is only called once per task. So
if you read your files in this method and store the structures in memory
then the overhead is relatively small (reading 15MB per mapper is
negligible compared to several GB of processed data).
Best regards,
Hi Andy,
I'm not sure if I entirely understood your question, but I think you're
looking for something like this:
select
concat(date,':',uid),
sum(1) as total,
sum(if(a=1,1,0)) AS ca,
sum(if(b=1,1,0)) AS cb,
sum(if(c=1,1,0)) AS cc
from mytable
group by uid, date;
Query like
Hi Manu,
I believe the last group by q2.auth_count is wrong, because it
causes computing average only across lines with same value of
q2.auth_count, which is of course equal to its value.
Best regards,
J. Dolinar
On Wed, Oct 10, 2012 at 8:19 AM, Manu A hadoophi...@gmail.com wrote:
Hi All,
The
Hi Richin,
Do you mean summing the values from two columns into one number, or
calculating sum of both columns into two sums in one query? Both is
possible, the first can be done simply as SUM(col1 + col2), the second can
be accomplished with two sums: sum(col1), sum(col2). Does that answer your
I'd suggest you to read a bit about shell scripts... To make this work you
have to pass the queries to the hive via -e option. The way you wrote it,
interactive cli would be run and the queries would be attempted to be
executed by shell when cli ends. The correct skript in your case shouild
look
Hi George,
You can save yourself one copying. Just create a new external table with
different name, fill it with data (either by copying or query like INSERT
OVERWRITE DIRECTORY '/new/table/path' SELECT * FROM oldtable), drop the old
one and then rename the new one to the desired name:
ALTER
? And it will not
scan the whole table right?**
*Raihan Jamal*
On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár dolik@gmail.com wrote:
Hi Jamal,
Check if the function really returns what it should and that your data
are really in MMdd format. You can do this by simple query like
Oops, sorry I made a copypaste mistake :) The annotation should read
@*UDFType(deterministic=true*)
Jan
On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár dolik@gmail.com wrote:
I'm afraid that he query
SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10;
will scan entire
cannot use hiveconf here as I am working with
Hive
0.6
Raihan Jamal
On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár dolik@gmail.com
wrote:
I'm afraid that he query
SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10;
will scan entire table, because
) LIMIT 10;
*Raihan Jamal*
On Tue, Aug 7, 2012 at 11:18 AM, Jan Dolinár dolik@gmail.com wrote:
By the way, even without hiveconf, you can run hive from shell like this
to achieve what you want using shell capabilities:
hive -e CREATE TEMPORARY FUNCTION yesterdaydate
The shell will interpret the query in your command as SELECT
... explode(split(timestamps, *#*)) ... if you run it the way you wrote
it, i.e. without the quotation. The way around this is to either escape the
quotation marks or use single quotes:
hive -e **SELECT user_id
,product_id
Hi Jamal,
Check if the function really returns what it should and that your data are
really in MMdd format. You can do this by simple query like this:
SELECT dt, yesterdaydate('MMdd') FROM REALTIME LIMIT 1;
I don't see anything wrong with the function itself, it works well for me
There are many ways, but beware that some of them may result in worse
performance when used inappropriately.
Some of the settings we use to achieve faster queries:
hive.map.aggr=true
hive.exec.parallel=true
hive.exec.compress.intermediate=true
mapred.job.reuse.jvm.num.tasks=-1
Structuring the
Hi,
You can use something like
where instr(field1, 'test1')=0
It is simple but not very reliable. Better solution is probably
where array_contains(split(field1, ' \| '),'test1')
I have not tested these, so you might need to tweak it a bit, but it is
well described in manual:
Hi,
Check the hadoop logs of the failed task. My best guess is that there is an
uncaught exception thrown somewhere in your code. The logs will tell where
and what caused the problem.
Best regards,
Jan
On Tue, Jun 26, 2012 at 4:20 PM, Yue Guan pipeha...@gmail.com wrote:
Hi, hive users
I
Hi,
I've hit problems when writing custom UDTF that should return string
values. I couldn't find anywhere what type should have the values that
get forward()ed to collector. The only info I could dig out from
google was few blogs with examples and 4 UDTFs that are among the hive
sources. From
:-)
Have you tried returning a Text value instead of String? Atleast in the case
of UDFs, returning Text instead of Strings is possible and recommended too.
I would think it would be the same case with UDTFs.
Mark
- Original Message -
From: Jan Dolinár dolik@gmail.com
To: user
On 6/15/12, Ruslan Al-Fakikh ruslan.al-fak...@jalent.ru wrote:
I didn't know InputFormat and LineReader could help, though I didn't
look at them closely. I was thinking about implementing a
Table-Generating Function (UDTF) if there is no an already implemented
solution.
Both is possible,
Hi Ruslan,
I've been in similar situation and solved it by writing a custom
InputFormat and LineReader that loads the data from MySQL in
constructor. In my case I use it just to check value ranges and
similar stuff. If you want to join the data with whats in your hdfs
files, you can do that as
On 6/7/12, Mark Grover grover.markgro...@gmail.com wrote:
Can you please check if predicate push down enabled changes the explain
plan on a simple inner join query like:
select a.* from a inner join b on(a.key=b.key) where a.some_col=blah;
No problem, I ran following as you suggested (INNER
Thank you very much Mark for your investigation and explanations.
I'm well aware of the fact that hadoop 0.7.1 is quite an old code and
that newer version might perform better - that is the main reason I
discussed it here instead of reporting it as a bug. For now it doesn't
bother me, as I have
columns.
If it doesn't work for both partition and non-partition columns, then
obviously Hive-1056 is not working for you. We can take it up on the
mailing list from there.
Thanks for your input, Jan.
Mark
On Tue, Jun 5, 2012 at 1:19 AM, Jan Dolinár dolik@gmail.com wrote:
On Mon
or even make the calculation longer.
Hope that helps... I can't think about anything else that could help you.
Jan
On 6/5/12, Jan Dolinár dolik@gmail.com wrote:
On 6/4/12, shan s mysub...@gmail.com wrote:
Thanks for the explanation Jan.
If I understand correctly, the input will be read one
On Fri, Jun 1, 2012 at 5:25 PM, shan s mysub...@gmail.com wrote:
I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
which would club the group-bys together.
However it is scheduling n jobs where n = number of group bys..
Could you please explain this behaviour.
No,
On Mon, Jun 4, 2012 at 7:20 PM, Mark Grover mgro...@oanda.com wrote:
Hi Jan,
Glad you found something workable.
What version of Hive are you using? Could you also please check what the
value of the property hive.optimize.ppd is for you?
Thanks,
Mark
Hi Mark,
Thanks for reply. I'm using
On 6/4/12, shan s mysub...@gmail.com wrote:
Thanks for the explanation Jan.
If I understand correctly, the input will be read one single time and will
be preprocessed in some form, and this intermediate data is used for
subsequent group-by..
Not sure if my scenario will help this single
On Fri, May 25, 2012 at 12:03 PM, Jan Dolinár dolik@gmail.com wrote:
-- see what happens when you try to perform multi-group-by query on one of
the partitions
EXPLAIN EXTENDED
FROM partition_test
LATERAL VIEW explode(col1) tmp AS exp_col1
INSERT OVERWRITE DIRECTORY '/test/1'
SELECT
Hello,
I've encountered a weird issue with hive and I'm not sure if I'm doing
something wrong or if it is a bug. I'm trying to do a multi-group-by select
statement on a partitioned table. I wan't only data from one partition,
therefore all the WHERE statements are exactly the same and contain
32 matches
Mail list logo