Re: Map Join Problems

2013-05-27 Thread Ted Xu
Hi Peter,

There are other parameters work with hive.auto.convert.join. Please
check "hive.hashtable.max.memory.usage"
and "hive.smalltable.filesize". They're quite self-explanatory but if you
need more details you can check the auto mapjoin JIRA ticket
https://issues.apache.org/jira/browse/HIVE-1642.


On Tue, May 28, 2013 at 11:17 AM, Peter Chu  wrote:

> Using Hive 0.8.1 on Amazon EMR Hadoop Job.
>
> Some problems with using mapjoin:
>
> 1) Exceed memory, I got the following errors.  Then I remove mapjoin in
> the query and instead set hive.auto.convert.join=true, thinking that let
> hive decides when mapjoin is suitable.  It does run much farther in the
> job, but then another similar error towards the end.
>
> 2) The I tried with mapjoin in the same query before and then set
> hive.mapjoin.localtask.max.memory.usage=3, same exact error.
>
> My questions is that is there any other settings I can use to increase
> mapjoin memory or hashtable size?  Or is there any other better options?
>
> 2013-05-25 11:37:39   Starting to launch local task to process map join;  
> maximum memory = 932118528
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> 2013-05-25 11:37:57   Processing rows:20  Hashtable size: 19  
> Memory usage:   776687416   rate:   0.833
> 2013-05-25 11:38:00   Processing rows:215031  Hashtable size: 215031  
> Memory usage:   813018320   rate:   0.872
> 2013-05-25 11:38:00   Dump the hashtable into file: 
> file:/tmp/hadoop/hive_2013-05-25_23-37-37_320_2027014861824847272/-local-10006/HashTable-Stage-6/MapJoin-bu-21--.hashtable
> Execution failed with exit status: 2
> Obtaining error information
>
> Task failed!
> Task ID:
>   Stage-10
>
> Logs:
>
>
> 3) Please look at the four errors example below.  My other question is that 
> of all the runs with that mapjoin error, there is a pattern, the mapjoin is 
> done on bu, which is a job before the error, all the error happens* just shy 
> of 4 rows of the bu table mapjoin*, I found this too much of a coincidence, 
> can someone please offer some insight?:
>
> #1:
>
> *215035* Rows loaded to 
> hdfs://10.190.182.26:9000/mnt/var/lib/hive_081/tmp/scratch/hive_2013-05-25_23-35-49_100_9059150281675034748/-ext-1
> MapReduce Jobs Launched:
> Job 0: Map: 18  Reduce: 13   Accumulative CPU: 139.31 sec   HDFS Read: 
> 1226414348 HDFS Write: 2179 SUCCESS
> Job 1: Map: 9   Accumulative CPU: 54.1 sec   HDFS Read: 687306237 HDFS Write: 
> 695722636 SUCCESS
> Job 2: Map: 16   Accumulative CPU: 89.09 sec   HDFS Read: 695838641 HDFS 
> Write: 703096594 SUCCESS
> Total MapReduce CPU Time Spent: 4 minutes 42 seconds 500 msec
> OK
> Time taken: 108.206 seconds
> OK
> Time taken: 0.013 seconds
> Total MapReduce jobs = 3
> Execution log at: 
> /tmp/hadoop/hadoop_20130525233737_8911fdca-6536-45bb-aac3-19b92b3de99c.log
> 2013-05-25 11:37:39   Starting to launch local task to process map join;  
> maximum memory = 932118528
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> 2013-05-25 11:37:57   Processing rows:20  Hashtable size: 19  
> Memory usage:   776687416   rate:   0.833
> 2013-05-25 11:38:00   Processing rows:*215031*Hashtable size: 
> 215031  Memory usage:   813018320   rate:   0.872
> 2013-05-25 11:38:00   Dump the hashtable into file: 
> file:/tmp/hadoop/hive_2013-05-25_23-37-37_320_2027014861824847272/-local-10006/HashTable-Stage-6/MapJoin-bu-21--.hashtable
> Execution failed with exit status: 2
> Obtaining error information
>
> Task failed!
> Task ID:
>   Stage-10
>
> Logs:
>
> #2
>
> Table default.badurls stats: [num_partitions: 0, num_files: 18, num_rows: 0, 
> total_size: 701922144, raw_data_size: 0]*214618* Rows loaded to 
> hdfs://10.46.205.55:9000/mnt/var/lib/hive_081/tmp/scratch/hive_2013-05-25_23-12-54_513_8781300101638774300/-ext-1
> MapReduce Jobs Launched:
> Job 0: Map: 21  Reduce: 13   Accumulative CPU: 142.11 sec   HDFS Read: 
> 1225164183 HDFS Write: 2179 SUCCESS
> Job 1: Map: 9   Accumulative CPU: 53.25 sec   HDFS Read: 686157231 HDFS 
> Write: 694562725 SUCCESS
> Job 2: Map: 18   Accumulative CPU: 92.26 sec   HDFS Read: 694650326 HDFS 
> Write: 701922144 SUCCESS
> Total MapReduce CPU Time Spent: 4 minutes 47 seconds 620 msec
> OK
> Time taken: 104.744 seconds

Map Join Problems

2013-05-27 Thread Peter Chu
Using Hive 0.8.1 on Amazon EMR Hadoop Job.
Some problems with using mapjoin:
1) Exceed memory, I got the following errors.  Then I remove mapjoin in the 
query and instead set hive.auto.convert.join=true, thinking that let hive 
decides when mapjoin is suitable.  It does run much farther in the job, but 
then another similar error towards the end.
2) The I tried with mapjoin in the same query before and then set 
hive.mapjoin.localtask.max.memory.usage=3, same exact error.
My questions is that is there any other settings I can use to increase mapjoin 
memory or hashtable size?  Or is there any other better options?
2013-05-25 11:37:39 Starting to launch local task to process map join;  
maximum memory = 932118528
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2013-05-25 11:37:57 Processing rows:20  Hashtable size: 19  
Memory usage:   776687416   rate:   0.833
2013-05-25 11:38:00 Processing rows:215031  Hashtable size: 215031  
Memory usage:   813018320   rate:   0.872
2013-05-25 11:38:00 Dump the hashtable into file: 
file:/tmp/hadoop/hive_2013-05-25_23-37-37_320_2027014861824847272/-local-10006/HashTable-Stage-6/MapJoin-bu-21--.hashtable
Execution failed with exit status: 2
Obtaining error information

Task failed!
Task ID:
  Stage-10

Logs:
3) Please look at the four errors example below.  My other question is that of 
all the runs with that mapjoin error, there is a pattern, the mapjoin is done 
on bu, which is a job before the error, all the error happens just shy of 4 
rows of the bu table mapjoin, I found this too much of a coincidence, can 
someone please offer some insight?:#1:215035 Rows loaded to 
hdfs://10.190.182.26:9000/mnt/var/lib/hive_081/tmp/scratch/hive_2013-05-25_23-35-49_100_9059150281675034748/-ext-1
MapReduce Jobs Launched: 
Job 0: Map: 18  Reduce: 13   Accumulative CPU: 139.31 sec   HDFS Read: 
1226414348 HDFS Write: 2179 SUCCESS
Job 1: Map: 9   Accumulative CPU: 54.1 sec   HDFS Read: 687306237 HDFS Write: 
695722636 SUCCESS
Job 2: Map: 16   Accumulative CPU: 89.09 sec   HDFS Read: 695838641 HDFS Write: 
703096594 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 42 seconds 500 msec
OK
Time taken: 108.206 seconds
OK
Time taken: 0.013 seconds
Total MapReduce jobs = 3
Execution log at: 
/tmp/hadoop/hadoop_20130525233737_8911fdca-6536-45bb-aac3-19b92b3de99c.log
2013-05-25 11:37:39 Starting to launch local task to process map join;  
maximum memory = 932118528
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2013-05-25 11:37:57 Processing rows:20  Hashtable size: 19  
Memory usage:   776687416   rate:   0.833
2013-05-25 11:38:00 Processing rows:215031  Hashtable size: 215031  
Memory usage:   813018320   rate:   0.872
2013-05-25 11:38:00 Dump the hashtable into file: 
file:/tmp/hadoop/hive_2013-05-25_23-37-37_320_2027014861824847272/-local-10006/HashTable-Stage-6/MapJoin-bu-21--.hashtable
Execution failed with exit status: 2
Obtaining error information

Task failed!
Task ID:
  Stage-10

Logs:#2Table default.badurls stats: [num_partitions: 0, num_files: 18, 
num_rows: 0, total_size: 701922144, raw_data_size: 0]
214618 Rows loaded to 
hdfs://10.46.205.55:9000/mnt/var/lib/hive_081/tmp/scratch/hive_2013-05-25_23-12-54_513_8781300101638774300/-ext-1
MapReduce Jobs Launched: 
Job 0: Map: 21  Reduce: 13   Accumulative CPU: 142.11 sec   HDFS Read: 
1225164183 HDFS Write: 2179 SUCCESS
Job 1: Map: 9   Accumulative CPU: 53.25 sec   HDFS Read: 686157231 HDFS Write: 
694562725 SUCCESS
Job 2: Map: 18   Accumulative CPU: 92.26 sec   HDFS Read: 694650326 HDFS Write: 
701922144 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 47 seconds 620 msec
OK
Time taken: 104.744 seconds
OK
Time taken: 0.013 seconds
Total MapReduce jobs = 3
Execution log at: 
/tmp/hadoop/hadoop_20130525231414_3cc50fdd-7e7a-4bcf-baab-b465804c6e49.log
2013-05-25 11:14:41 Starting to launch local task to process map join;  
maximum memory = 932118528
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/hadoop/.versions/hive-0.8.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder

Wrapping an analytical function

2013-05-27 Thread Frans Drijver
Hi there,


I am trying the following:

select case when ta_end_datetime_berekenen = 'Y' then lead(ta_update_datetime) 
over ( partition by dn_waarde_van, dn_waarde_tot order by ta_update_datetime ) 
else ea_end_datetime end ea_end_datetime, ta_insert_datetime, 
ta_update_datetime from tmp_wtdh_bestedingsklasse_10_s2_stap2

However, when I try that, I get the following error:

NoViableAltException(86@[129:7: ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN 
identifier ( COMMA identifier )* RPAREN ) )?])

FAILED: ParseException line 1:175 missing KW_END at 'over' near ')' in 
selection target
line 1:254 cannot recognize input near 'else' 'ea_end_datetime' 'end' in 
selection target

Would I be correct in assuming that it's not possible to wrap an analytical 
function in another function?

This is with Hive 0.11.

Any feedback is appreciated!


Kind regards,


Frans Drijver

Re: Partitioning confusion

2013-05-27 Thread Nitin Pawar
if you have a 250GB file, then how did it become 2.5TB ?

if you can not write a mapreduce job to process to write your data into
specific partitions, then the other way around would be load the entire
data into a temporary table and then load data into partitioned tabled and
then drop the temporary table.


approach would be
1) Create a temporary table
CREATE TABLE temp_employees (name STRING, salary FLOAT, subordinates
ARRAY, deductions MAP, address STRUCT );

2) Load data into this table
LOAD DATA LOCAL INPATH
'/home/satish/data/employees/input/employees-country.txt'
INTO TABLE temp_employees;

3) Create partitoned table
CREATE TABLE employees (name STRING, salary FLOAT, subordinates
ARRAY, deductions MAP, address STRUCT ) PARTITIONED BY
(country STRING, state STRING);

2) Load data from temporary data into partitioned tables
   2a) if there are already partitions existing then be careful giving a
blanket query or it will overwrite data into exisiting partitions

insert overwrite employees partition(country='USA',state='IL') select *
from temp_employees where country="USA" and state ="IL";
this you will need to do for all of them


   2b) if your partitioned table is empty currently then you can just
enable dynamic partitioning and hive will take care of adding correct data
into individual partitions
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;

insert overwrite table employees select * from employees;

3) drop the temporary table

by doing above approach you will have your data being replicated 2x times
of replication factor for sometime but then once you drop temporary table
it will be normal HDFS replication factor size data.

Its been sometime I have worked on hive so others may have better
approaches as well, so wait for someone to correct me before going further
:)


On Mon, May 27, 2013 at 2:13 PM, Sai Sai  wrote:

> Nitin
> I am still confused, from the below data that  i have given should the
> file which sits in the folder Country=USA and state=IL have only the rows
> where Country=USA and state=IL or will it have rows of other countries also.
> The reason i ask is because if we have a 250GB file and would like to
> create 10 partitions that would end up in 2.5 TB * 3 = 7.5TB. Is this
> expected.
> Thanks
> S
>
>   --
>  *From:* Nitin Pawar 
> *To:* user@hive.apache.org; Sai Sai 
> *Sent:* Monday, 27 May 2013 2:08 PM
> *Subject:* Re: Partitioning confusion
>
> when you specify the load data query with specific partition, it will put
> the entire data into that partition.
>
>
>
> On Mon, May 27, 2013 at 1:08 PM, Sai Sai  wrote:
>
>
> After creating a partition for a country (USA) and state (IL) and when we
> go to the the hdfs site to look at the partition in the browser we r seeing
>  all the records for all the countries and states rather than just for the
> partition created for US and IL given below, is this correct behavior:
> 
> Here is my commands:
> 
>
> CREATE TABLE employees (name STRING, salary FLOAT, subordinates
> ARRAY, deductions MAP, address STRUCT city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY
> (country STRING, state STRING);
>
> LOAD DATA LOCAL INPATH
> '/home/satish/data/employees/input/employees-country.txt' INTO TABLE
> employees PARTITION (country='USA',state='IL');
>
> 
> Here is my original data file, where i have a few countries data such as
> USA, INDIA, UK, AUS:
> 
>
> John Doe10.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
> Mary Smith8.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601USA
> Todd Jones7.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700USA
> Bill King6.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100USA
> Boss Man20.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
> Fred Finance15.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
> Stacy Accountant6.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563USA
> John Doe 210.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
> Mary Smith 28.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601INDIA
> Todd Jones 27.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700AUSTRALIA
> Bill King 26.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100AUSTRALIA
> Boss Man2 20.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
> Fred Finance 215.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
> Stacy Accountant 26.0Federal Taxes.

Re: Partitioning confusion

2013-05-27 Thread Sai Sai
Nitin
I am still confused, from the below data that  i have given should the file 
which sits in the folder Country=USA and state=IL have only the rows where 
Country=USA and state=IL or will it have rows of other countries also.
The reason i ask is because if we have a 250GB file and would like to create 10 
partitions that would end up in 2.5 TB * 3 = 7.5TB. Is this expected.
Thanks
S



 From: Nitin Pawar 
To: user@hive.apache.org; Sai Sai  
Sent: Monday, 27 May 2013 2:08 PM
Subject: Re: Partitioning confusion
 


when you specify the load data query with specific partition, it will put the 
entire data into that partition. 




On Mon, May 27, 2013 at 1:08 PM, Sai Sai  wrote:


>
>After creating a partition for a country (USA) and state (IL) and when we go 
>to the the hdfs site to look at the partition in the browser we r seeing  all 
>the records for all the countries and states rather than just for the 
>partition created for US and IL given below, is this correct behavior:
>
>Here is my commands:
>
>
>
>
>CREATE TABLE employees (name STRING, salary FLOAT, subordinates ARRAY, 
>deductions MAP, address STRUCTstate:STRING, zip:INT, country:STRING> ) PARTITIONED BY (country STRING, state 
>STRING);
>
>
>LOAD DATA LOCAL INPATH 
>'/home/satish/data/employees/input/employees-country.txt' INTO TABLE employees 
>PARTITION (country='USA',state='IL');
>
>
>
>
>Here is my original data file, where i have a few countries data such as USA, 
>INDIA, UK, AUS:
>
>
>
>
>John Doe10.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 
>Michigan Ave.ChicagoIL60600USA
>Mary Smith8.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario 
>St.ChicagoIL60601USA
>Todd Jones7.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak 
>ParkIL60700USA
>Bill King6.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure 
>Dr.ObscuriaIL60100USA
>Boss Man20.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 
>Pretentious Drive.ChicagoIL60500USA
>Fred Finance15.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 
>Pretentious Drive.ChicagoIL60500USA
>Stacy Accountant6.0Federal Taxes.15State Taxes.03Insurance.1300 Main 
>St.NapervilleIL60563USA
>John Doe 210.0Mary SmithTodd JonesFederal Taxes.2State 
>Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
>Mary Smith 28.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 
>Ontario St.ChicagoIL60601INDIA
>Todd Jones 27.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago 
>Ave.Oak ParkIL60700AUSTRALIA
>Bill King 26.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure 
>Dr.ObscuriaIL60100AUSTRALIA
>Boss Man2 20.0John DoeFred FinanceFederal Taxes.3State 
>Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
>Fred Finance 215.0Stacy AccountantFederal Taxes.3State 
>Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
>Stacy Accountant 26.0Federal Taxes.15State Taxes.03Insurance.1300 Main 
>St.NapervilleIL60563UK
>
>
>Now when i navigate to:
>Contents of directory 
>/user/hive/warehouse/db1.db/employees/country=USA/state=IL
>
>
>
>I see all the records and was wondering if it should have only USA & IL 
>records.
>Please help.


-- 
Nitin Pawar

Re: Partitioning confusion

2013-05-27 Thread Nitin Pawar
when you specify the load data query with specific partition, it will put
the entire data into that partition.



On Mon, May 27, 2013 at 1:08 PM, Sai Sai  wrote:

>
> After creating a partition for a country (USA) and state (IL) and when we
> go to the the hdfs site to look at the partition in the browser we r seeing
>  all the records for all the countries and states rather than just for the
> partition created for US and IL given below, is this correct behavior:
> 
> Here is my commands:
> 
>
> CREATE TABLE employees (name STRING, salary FLOAT, subordinates
> ARRAY, deductions MAP, address STRUCT city:STRING, state:STRING, zip:INT, country:STRING> ) PARTITIONED BY
> (country STRING, state STRING);
>
> LOAD DATA LOCAL INPATH
> '/home/satish/data/employees/input/employees-country.txt' INTO TABLE
> employees PARTITION (country='USA',state='IL');
>
> 
> Here is my original data file, where i have a few countries data such as
> USA, INDIA, UK, AUS:
> 
>
> John Doe10.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600USA
> Mary Smith8.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601USA
> Todd Jones7.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700USA
> Bill King6.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100USA
> Boss Man20.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500USA
> Fred Finance15.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500USA
> Stacy Accountant6.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563USA
> John Doe 210.0Mary SmithTodd JonesFederal Taxes.2State
> Taxes.05Insurance.11 Michigan Ave.ChicagoIL60600INDIA
> Mary Smith 28.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100
> Ontario St.ChicagoIL60601INDIA
> Todd Jones 27.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago
> Ave.Oak ParkIL60700AUSTRALIA
> Bill King 26.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure
> Dr.ObscuriaIL60100AUSTRALIA
> Boss Man2 20.0John DoeFred FinanceFederal Taxes.3State
> Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
> Fred Finance 215.0Stacy AccountantFederal Taxes.3State
> Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
> Stacy Accountant 26.0Federal Taxes.15State Taxes.03Insurance.1300 Main
> St.NapervilleIL60563UK
> 
> Now when i navigate to:
> Contents of directory
> /user/hive/warehouse/db1.db/employees/country=USA/state=IL
> 
> I see all the records and was wondering if it should have only USA & IL
> records.
> Please help.
>



-- 
Nitin Pawar


Re:Partitioning confusion

2013-05-27 Thread Sai Sai


After creating a partition for a country (USA) and state (IL) and when we go to 
the the hdfs site to look at the partition in the browser we r seeing  all the 
records for all the countries and states rather than just for the partition 
created for US and IL given below, is this correct behavior:

Here is my commands:



CREATE TABLE employees (name STRING, salary FLOAT, subordinates ARRAY, 
deductions MAP, address STRUCT ) PARTITIONED BY (country STRING, state 
STRING);

LOAD DATA LOCAL INPATH 
'/home/satish/data/employees/input/employees-country.txt' INTO TABLE employees 
PARTITION (country='USA',state='IL');



Here is my original data file, where i have a few countries data such as USA, 
INDIA, UK, AUS:



John Doe10.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 
Michigan Ave.ChicagoIL60600USA
Mary Smith8.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario 
St.ChicagoIL60601USA
Todd Jones7.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak 
ParkIL60700USA
Bill King6.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure 
Dr.ObscuriaIL60100USA
Boss Man20.0John DoeFred FinanceFederal Taxes.3State Taxes.07Insurance.051 
Pretentious Drive.ChicagoIL60500USA
Fred Finance15.0Stacy AccountantFederal Taxes.3State Taxes.07Insurance.052 
Pretentious Drive.ChicagoIL60500USA
Stacy Accountant6.0Federal Taxes.15State Taxes.03Insurance.1300 Main 
St.NapervilleIL60563USA
John Doe 210.0Mary SmithTodd JonesFederal Taxes.2State Taxes.05Insurance.11 
Michigan Ave.ChicagoIL60600INDIA
Mary Smith 28.0Bill KingFederal Taxes.2State Taxes.05Insurance.1100 Ontario 
St.ChicagoIL60601INDIA
Todd Jones 27.0Federal Taxes.15State Taxes.03Insurance.1200 Chicago Ave.Oak 
ParkIL60700AUSTRALIA
Bill King 26.0Federal Taxes.15State Taxes.03Insurance.1300 Obscure 
Dr.ObscuriaIL60100AUSTRALIA
Boss Man2 20.0John DoeFred FinanceFederal Taxes.3State 
Taxes.07Insurance.051 Pretentious Drive.ChicagoIL60500UK
Fred Finance 215.0Stacy AccountantFederal Taxes.3State 
Taxes.07Insurance.052 Pretentious Drive.ChicagoIL60500UK
Stacy Accountant 26.0Federal Taxes.15State Taxes.03Insurance.1300 Main 
St.NapervilleIL60563UK


Now when i navigate to:
Contents of directory /user/hive/warehouse/db1.db/employees/country=USA/state=IL



I see all the records and was wondering if it should have only USA & IL records.
Please help.

Re: Issue with Json tuple lateral view

2013-05-27 Thread Sai Sai
Thanks Navis



 From: Navis류승우 
To: user@hive.apache.org; Sai Sai  
Sent: Monday, 27 May 2013 12:15 PM
Subject: Re: Issue with Json tuple lateral view
 

Removing last ',' in second row would make result as you expected.

I can't tell it's bug or not.

2013/5/27 Sai Sai :
>
> *
> Here is the json-data that i load:
> *
>
> { "blogID" : "FJY26J1333", "date" : "2012-04-01", "name" : "vpxnksu",
> "comment" : "good stuff", "contact" : { "email" : "vpxn...@gmail.com",
> "website" : "vpxnksu.wordpress.com" } }
> { "blogID" : "VSAUMDFGSD", "date" : "2012-04-01", "name" : "yhftrcx",
> "comment" : "another comment",}
>
> *
> Here is the hive commands :
> *
>
> CREATE  EXTERNAL  TABLE json_comments(value STRING) LOCATION
> '/user/json-comments';
>
> LOAD DATA LOCAL INPATH '/home/satish/data/inputSai/json-comments.txt'
> OVERWRITE INTO TABLE json_comments;
>
> SELECT b.blogID, c.email FROM json_comments a LATERAL VIEW
> json_tuple(a.value, 'blogID', 'contact') b AS blogID, contact LATERAL VIEW
> json_tuple(b.contact, 'email', 'website') c AS email, website;
>
> *
> Here r the results of  map reduce:
> *
>
> blogid email
> FJY26J1333 vpxn...@gmail.com
> NULL NULL
>
> *
> My question is why the 2nd row is coming up as Null values, i was expecting
> the results to be like this:
> *
>
> blogid email
> FJY26J1333 vpxn...@gmail.com
> VSAUMDFGSD NULL
>
> Any input is appreciated in understanding this.
> Thanks
> S