RE: equijoin with multiple columns?

Joydeep Sen Sarma Fri, 23 Jan 2009 19:38:39 -0800

Moral of the story - don't google around too much before writing code.

-----Original Message-----
From: Raghu Murthy [mailto:ra...@facebook.com] 
Sent: Friday, January 23, 2009 4:01 PM
To: hive-user@hadoop.apache.org
Subject: Re: equijoin with multiple columns?


We could add trim to hive load, but, unfortunately it has been patented by
IBM!

"Method of automatically removing leading and trailing space characters
from data being entered into a database system"

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1
&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
,086&RS=PN/7,475,086

"Retrieving data from a database system without leading and trailing
space characters"

http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1="20070282820".PGNR.&OS=DN/200
70282820&RS=DN/20070282820


On 1/23/09 3:55 PM, "Aaron Kimball" <aa...@cloudera.com> wrote:

> I solved the bug. As it turns out, there was trailing whitespace in some of my
> input files. So Hive wasn't matching the column's contents correctly, since
> Hive doesn't do an implicit trim() (my handwritten code did this). And of
> course, when I looked at the input files with less to verify their contents,
> the whitespace was invisible :P
> 
> - Aaron
> 
> On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo <athu...@facebook.com> wrote:
>> Aaron
>> 
>> can you send the plan for this query..
>> 
>> explain extended <your query>
>> 
>> This is very unexpected..
>> 
>> Thanks,
>> Ashish
>> ________________________________________
>> From: Aaron Kimball [aa...@cloudera.com]
>> Sent: Friday, January 23, 2009 1:24 AM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: equijoin with multiple columns?
>> 
>> Not only can I verify that the rows are present, I broke down and coded the
>> join in Java MapReduce by hand. Works.
>> - Aaron
>> 
>> On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
>> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote:
>> This is unexpected. We did try similar cases before (with multiple join
>> keys).
>> 
>> Can you verify by finding out the actual rows in each table that will produce
>> a join result row?
>> 
>> Zheng
>> 
>> 
>> On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
>> <aa...@cloudera.com<mailto:aa...@cloudera.com>> wrote:
>> Hi Zheng,
>> 
>> I was eliding some information to try to boil things down a bit -- but here
>> goes.
>> 
>> 
>> To expand the white lie into the fuller debug dump:
>> table A ~ ak_filtered_ips
>> table B ~ ip_locations
>> 
>> It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.
>> 
>> hive> describe extended ak_filtered_ips;
>> OK
>> a    int
>> b    int
>> c    int
>> Detailed Table Information:
>> Table(tableName:ak_filtered_
>> 
ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
o
>> n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
>> FieldSchema(name:b,type:int,comment:null),
>> FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
>> 
redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
o
>> 
op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
T
>> 
extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
s
>> 
erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
a
>> 
meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
t
>> itionKeys:[],parameters:{})
>> 
>> hive> describe extended ip_locations;
>> OK
>> a    int
>> b    int
>> c    int
>> cityid    int
>> countryid    int
>> Detailed Table Information:
>> 
Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
,
>> 
lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
p
>> e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
>> FieldSchema(name:c,type:int,comment:null),
>> FieldSchema(name:cityid,type:int,comment:null),
>> FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
>> name 
>> 
redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
.
>> 
mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
t
>> 
OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
i
>> 
alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
t
>> 
ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
i
>> onKeys:[],parameters:{})
>> 
>> I actually just loaded in a super-small test case involving A, B, x, y, and z
>> as I initially described. And that statement *does* work.
>> 
>> The actual statement I was running into problems with is:
>> 
>> SELECT ip_locations.cityid, ip_locations.countryid FROM
>> ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND
>> ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)
>> 
>> I would eventually like to add a "count(1)" to the field list, and a "GROUP
>> BY 
>> ip_locations.cityid, ip_locations.countryid" at the end.
>> 
>> This SELECT statement returns no rows.
>> 
>> Thanks,
>> - Aaron
>> 
>> 
>> On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
>> <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote:
>> Can you do "describe extended a;" and "describe extended b;" and pste
>> the result here?
>> 
>> Zheng
>> 
>> 
>> On 1/22/09, Aaron Kimball <aa...@cloudera.com<mailto:aa...@cloudera.com>>
>> wrote:
>>>> Hi all,
>>>> 
>>>> I have a perplexing problem with joins in Hive. I have two tables "A" and
>>>> "B".  A contains two columns x and y. B also contains x and y, and also z.
>>>> I
>>>> want to select the values of z from B, for every (x, y) I see in table A.
>>>> 
>>>> So I wrote this statement:
>>>> 
>>>> SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);
>>>> 
>>>> This crunches away and then comes back with an empty resultset. I have
>>>> exported the same data into MySQL and ran the same query; it works just
>>> fine
>>>> there. Does Hive support multi-column join criteria?
>>>> 
>>>> Thanks,
>>>> - Aaron Kimball
>>>> 
>> 
>> --
>> Sent from Gmail for mobile | mobile.google.com <http://mobile.google.com>
>> <http://mobile.google.com>
>> 
>> Yours,
>> Zheng
>> 
>> 
>> 
>> 
>> --
>> Yours,
>> Zheng
>> 
>

RE: equijoin with multiple columns?

Reply via email to