Re: Help in debugging Hive Query
Hi Sanjay, Thanks for taking the time to write all the details. I did a silly mistake. The data type for visit_page_num, i created it as string. The string was causing issues when I am using the max function. A type cast to int in the query worked for me. Regards, Raj From: Sanjay Subramanian To: "user@hive.apache.org" Sent: Thursday, July 25, 2013 1:41 PM Subject: Re: Help in debugging Hive Query The query is correct but since u r creating a managed table , that is possibly creating some issue and the records are not all getting created This is what I would propose CHECKPOINT 1 : Is this query running at all ? === Use this option in BOLD and run the QUERY ONLY (without any table creation) to log errors and pipe to a log file by using nohup or some other way that u prefer hive -hiveconf hive.root.logger=INFO,console -e select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these additional options === Required params: SET mapreduce.job.maps=500; SET mapreduce.job.reduces=8; SET mapreduce.tasktracker.map.tasks.maximum=12; SET mapreduce.tasktracker.reduce.tasks.maximum=8; SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapreduce.map.output.compress=true; Optional params: --- If u r using compression in output , use the following ; u can change the LzoCodec to whatever u r using for compression SET hive.exec.compress.intermediate=true; SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; SET mapreduce.output.fileoutputformat.compress=true; Thanks Sanjay From: Raj Hadoop Reply-To: "user@hive.apache.org" , Raj Hadoop Date: Thursday, July 25, 2013 5:00 AM To: Hive Subject: Help in debugging Hive Query All, I am trying to determine visits for customer from omniture weblog file using Hive. Table: omniture_web_data Columns: visid_high,visid_low,evar23,visit_page_num Sample Data: visid_high,visid_low,evar23,visit_page_num 999,888,1003,10 999,888,1003,14 999,888,1003,6 999,777,1003,12 999,777,1003,20 I want to calculate for each Customer Number ( evar23 is Customer Number ) , total visits. visid_high and visid_low determines a unique visit. For each distinct visitor, calculate sum of maximum visit_page_num. In above example 14 + 20 = 34 should be the total visits for the customer 1003. I am trying to run the following queries - Method 1 is almost the same as Method 2. Except in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to all. In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result as Method 1. Any suggestions on how to trouble shoot. ALso, any alternative approaches. // Method 1 select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003') a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data where evar23='1003' group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; / Result of Method 1 100334 // Method 2 create table temp123 as select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; select * from temp123 where evar23='1003'; // The Result of Method 2 is not the same as Method 1. It is showing a different number. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Help in debugging Hive Query
The query is correct but since u r creating a managed table , that is possibly creating some issue and the records are not all getting created This is what I would propose CHECKPOINT 1 : Is this query running at all ? === Use this option in BOLD and run the QUERY ONLY (without any table creation) to log errors and pipe to a log file by using nohup or some other way that u prefer hive -hiveconf hive.root.logger=INFO,console -e select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these additional options === Required params: SET mapreduce.job.maps=500; SET mapreduce.job.reduces=8; SET mapreduce.tasktracker.map.tasks.maximum=12; SET mapreduce.tasktracker.reduce.tasks.maximum=8; SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapreduce.map.output.compress=true; Optional params: --- If u r using compression in output , use the following ; u can change the LzoCodec to whatever u r using for compression SET hive.exec.compress.intermediate=true; SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; SET mapreduce.output.fileoutputformat.compress=true; Thanks Sanjay From: Raj Hadoop mailto:hadoop...@yahoo.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" mailto:user@hive.apache.org>>, Raj Hadoop mailto:hadoop...@yahoo.com>> Date: Thursday, July 25, 2013 5:00 AM To: Hive mailto:user@hive.apache.org>> Subject: Help in debugging Hive Query All, I am trying to determine visits for customer from omniture weblog file using Hive. Table: omniture_web_data Columns: visid_high,visid_low,evar23,visit_page_num Sample Data: visid_high,visid_low,evar23,visit_page_num 999,888,1003,10 999,888,1003,14 999,888,1003,6 999,777,1003,12 999,777,1003,20 I want to calculate for each Customer Number ( evar23 is Customer Number ) , total visits. visid_high and visid_low determines a unique visit. For each distinct visitor, calculate sum of maximum visit_page_num. In above example 14 + 20 = 34 should be the total visits for the customer 1003. I am trying to run the following queries - Method 1 is almost the same as Method 2. Except in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to all. In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result as Method 1. Any suggestions on how to trouble shoot. ALso, any alternative approaches. // Method 1 select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003') a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data where evar23='1003' group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; / Result of Method 1 100334 // Method 2 create table temp123 as select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; select * from temp123 where evar23='1003'; // The Result of Method 2 is not the same as Method 1. It is showing a different number. Thanks, Raj CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Help in debugging Hive Query
All, I am trying to determine visits for customer from omniture weblog file using Hive. Table: omniture_web_data Columns: visid_high,visid_low,evar23,visit_page_num Sample Data: visid_high,visid_low,evar23,visit_page_num 999,888,1003,10 999,888,1003,14 999,888,1003,6 999,777,1003,12 999,777,1003,20 I want to calculate for each Customer Number ( evar23 is Customer Number ) , total visits. visid_high and visid_low determines a unique visit. For each distinct visitor, calculate sum of maximum visit_page_num. In above example 14 + 20 = 34 should be the total visits for the customer 1003. I am trying to run the following queries - Method 1 is almost the same as Method 2. Except in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to all. In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result as Method 1. Any suggestions on how to trouble shoot. ALso, any alternative approaches. // Method 1 select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003') a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data where evar23='1003' group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; / Result of Method 1 100334 // Method 2 create table temp123 as select a.evar23,sum(b.max_visit_page_num) from (select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a JOIN (select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b where a.visid_high=b.visid_high and a.visid_low=b.visid_low group by a.evar23; select * from temp123 where evar23='1003'; // The Result of Method 2 is not the same as Method 1. It is showing a different number. Thanks, Raj