Re: how to avoid scan the same table multi times?

2012-01-13 Thread Xiaobin She
hi,

I use the multiple inserts method, and I write an sql like this:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct
v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1),
count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1),
count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1),
count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1),
count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1),
count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1),
count(distinct v1) where v4=1 or v4=5 group by v3


it works, and the output result is what I want.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs
one by one.

I run explain on this query, and I got the following message:


STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17


it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

Or how can I make job 1-8 run concurrently?

thank you very much for your help again!

xiaobin



在 2012年1月13日 下午12:01,Xiaobin She xiaobin...@gmail.com写道:

 to Martin, Mark and Edward,

 thank you for your advices, I will try it out.

 And to Martin, by appropriate data format, do you mean something like 
 2012011202 ?

 thanks!

 xiaobin

 在 2012年1月12日 下午10:20,Martin Kuhn martin.k...@affinitas.de写道:

 Hi there,

  Select count(*), count(distinct u), type from t group by type where
 plat=1 and dt=”2012-1-12-02”
  Select count(*), count(distinct u), type from t where (type =2 or type
 =6) and dt=”2012-1-12-02” group by type;

  Is there a better way to do these queries?

 You could try something like this:

 SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
 FROM
t
 WHERE
dt in (2012-1-12-02, 2012-1-12-03)
 GROUP BY
type
 ORDER BY
type
 ;

 Good luck :)
 Martin Kuhn


 P.S.  You'ge got a strange date format there. For sorting purposes it
 would be more appropriate to use something like 2012-01-12-02.





Re: how to avoid scan the same table multi times?

2012-01-13 Thread Mark Grover
Xiaobin,
I would assume that's what Martin meant. You can always add dashes to make it 
more readable:-)

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Xiaobin She xiaobin...@gmail.com
To: user@hive.apache.org, 佘晓彬 xiaobin...@gmail.com
Sent: Thursday, January 12, 2012 11:01:19 PM
Subject: Re: how to avoid scan the same table multi times?

to Martin, Mark and Edward,

thank you for your advices, I will try it out.

And to Martin, by appropriate data format, do you mean something like 
2012011202 ?

thanks!

xiaobin


在 2012年1月12日 下午10:20,Martin Kuhn  martin.k...@affinitas.de  写道:


Hi there,


 Select count(*), count(distinct u), type from t group by type where plat=1 
 and dt=”2012-1-12-02”

 Select count(*), count(distinct u), type from t where (type =2 or type =6) 
 and dt=”2012-1-12-02” group by type;


 Is there a better way to do these queries?

You could try something like this:

SELECT
type
, count(*)
, count(DISTINCT u)
, count(CASE WHEN plat=1 THEN u ELSE NULL)
, count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
, count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
, count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
FROM
t
WHERE
dt in (2012-1-12-02, 2012-1-12-03)
GROUP BY
type
ORDER BY
type
;

Good luck :)
Martin Kuhn


P.S. You'ge got a strange date format there. For sorting purposes it would be 
more appropriate to use something like  2012-01-12-02 .




Re: how to avoid scan the same table multi times?

2012-01-12 Thread Martin Kuhn
Hi there,

 Select count(*), count(distinct u), type from t group by type where plat=1 
 and dt=”2012-1-12-02”
 Select count(*), count(distinct u), type from t where (type =2 or type =6) 
 and dt=”2012-1-12-02” group by type;

 Is there a better way to do these queries?

You could try something like this:

SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
FROM
t
WHERE
dt in (2012-1-12-02, 2012-1-12-03)
GROUP BY
type
ORDER BY
type
;

Good luck :)
Martin Kuhn


P.S.  You'ge got a strange date format there. For sorting purposes it would be 
more appropriate to use something like 2012-01-12-02.



Re: how to avoid scan the same table multi times?

2012-01-12 Thread Mark Grover
Hi Xiaobin,
There is functionality like (from: 
https://cwiki.apache.org/Hive/languagemanual-dml.html):
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

This will do one pass through your table's partitions but group and filter them 
according to different criteria. As you become more proficient with Hive, you 
could consider creating an index on some columns or bucketing on those columns 
(for optimization in joins, for example) if you end up 
filtering/grouping/joining on them over and over again.

Good luck!

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Martin Kuhn martin.k...@affinitas.de
To: user@hive.apache.org
Sent: Thursday, January 12, 2012 9:20:00 AM
Subject: Re: how to avoid scan the same table multi times?

Hi there,

 Select count(*), count(distinct u), type from t group by type where plat=1 
 and dt=”2012-1-12-02”
 Select count(*), count(distinct u), type from t where (type =2 or type =6) 
 and dt=”2012-1-12-02” group by type;

 Is there a better way to do these queries?

You could try something like this:

SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
FROM
t
WHERE
dt in (2012-1-12-02, 2012-1-12-03)
GROUP BY
type
ORDER BY
type
;

Good luck :)
Martin Kuhn


P.S.  You'ge got a strange date format there. For sorting purposes it would be 
more appropriate to use something like 2012-01-12-02.



Re: how to avoid scan the same table multi times?

2012-01-12 Thread Edward Capriolo
From:
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2
...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...;



2012/1/12 Xiaobin She xiaobin...@gmail.com

 Hello, everyone,

 I'm new to hive, and I got some questions.

 I have a table like this:

 create table t(id int, time string, ip string, u bigint, ret int, plat
 int, type int, u2 bigint, ver int)  PARTITIONED BY(dt STRING)   ROW FORMAT
 DELIMITED FIELDS TERMINATED BY ','  lines TERMINATED BY '\n' ;

 and I will do lots of query on this table base on different value of the
 column, like:


 Select count(*), count(distinct u), type from t group by type where plat=1
 and dt=”2012-1-12-02”

 Select count(*), count(distinct u), type from t group by type where plat=2
 and dt=”2012-1-12-02”

 Select count(*), count(distinct u), type from t where (type =2 or type =6)
 and dt=”2012-1-12-02” group by type;

 Select count(*), count(distinct u), type from t where (type =1 or type =5)
 and dt=”2012-1-12-02” group by type;

 Select count(*), count(distinct u), type from t where (type =1 or type =5)
 and (dt=”2012-1-12-02” and dt=”2012-1-12-03”) group by type;

 but these queries seems not so effective, because they query on the same
 table for multiple times, and that meas it will scan the same files for
 many times.

 And my question is , how can I avoid this?
 Is there a better way to do these queries?

 Thank you very much for your help!




Re: how to avoid scan the same table multi times?

2012-01-12 Thread Xiaobin She
to Martin, Mark and Edward,

thank you for your advices, I will try it out.

And to Martin, by appropriate data format, do you mean something like
2012011202 ?

thanks!

xiaobin

在 2012年1月12日 下午10:20,Martin Kuhn martin.k...@affinitas.de写道:

 Hi there,

  Select count(*), count(distinct u), type from t group by type where
 plat=1 and dt=”2012-1-12-02”
  Select count(*), count(distinct u), type from t where (type =2 or type
 =6) and dt=”2012-1-12-02” group by type;

  Is there a better way to do these queries?

 You could try something like this:

 SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
 FROM
t
 WHERE
dt in (2012-1-12-02, 2012-1-12-03)
 GROUP BY
type
 ORDER BY
type
 ;

 Good luck :)
 Martin Kuhn


 P.S.  You'ge got a strange date format there. For sorting purposes it
 would be more appropriate to use something like 2012-01-12-02.