The query is correct but since u r creating a managed table , that is possibly 
creating some issue and the records are not all getting created

This is what I would propose

CHECKPOINT  1 : Is this query running at all ?
===================================
Use this option in BOLD and run the QUERY ONLY (without any table creation) to 
log errors and pipe to a log file by using nohup or some other way that u prefer
hive -hiveconf hive.root.logger=INFO,console -e

select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;


CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these 
additional options
===========================================================================
Required params:
------------------------
SET mapreduce.job.maps=500;
SET mapreduce.job.reduces=8;
SET mapreduce.tasktracker.map.tasks.maximum=12;
SET mapreduce.tasktracker.reduce.tasks.maximum=8;
SET 
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.map.output.compress=true;


Optional params:
-----------------------
If u r using compression in output , use the following ; u can change the 
LzoCodec to whatever u r using for compression
SET hive.exec.compress.intermediate=true;
SET hive.exec.compress.output=true;
SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
SET mapreduce.output.fileoutputformat.compress=true;


Thanks

Sanjay


From: Raj Hadoop <hadoop...@yahoo.com<mailto:hadoop...@yahoo.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>, Raj Hadoop 
<hadoop...@yahoo.com<mailto:hadoop...@yahoo.com>>
Date: Thursday, July 25, 2013 5:00 AM
To: Hive <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Help in debugging Hive Query

All,

I am trying to determine visits for customer from omniture weblog file using 
Hive.

Table: omniture_web_data
Columns: visid_high,visid_low,evar23,visit_page_num

Sample Data:
visid_high,visid_low,evar23,visit_page_num
999,888,1003,10
999,888,1003,14
999,888,1003,6
999,777,1003,12
999,777,1003,20

I want to calculate for each Customer Number ( evar23 is  Customer Number ) , 
total visits. visid_high and visid_low determines a unique visit.
For each distinct visitor, calculate sum of maximum visit_page_num. In above 
example

14 + 20 = 34 should be the total visits for the customer 1003.

I am trying to run the following queries - Method 1 is almost the same as 
Method 2. Except in Method 1 I only choose a particualr customer number 1003. 
In method 2 , i generalized to all.

In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting 
the same result as Method 1.

Any suggestions on how to trouble shoot. ALso, any alternative approaches.

// Method 1
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data where 
evar23='1003') a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data where evar23='1003' group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

///// Result of Method 1

1003    34

// Method 2

create table temp123 as
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from 
omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

select * from temp123 where evar23='1003';

// The Result of Method 2 is not the same as Method 1. It is showing a 
different number.



Thanks,
Raj



CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

Reply via email to