Re: Files does not exist error: concurrency control on hive queries...

2009-09-11 Thread Eva Tse
Doing versioning would work for this scenario. It essentially achieves the same thing. On 9/11/09 2:39 AM, "Ashish Thusoo" wrote: > Another option is to deal with this using versioning. Some ideas on this are > at > > https://issues.apache.org/jira/browse/HIVE-718 > > Ashish > ___

Re: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Sudipto Das
I am seeing a very weird behavior. When the inner table is small (~17k rows each with two columns (INT,DOUBLE)), the join executes in about 2 mins (outer table is 500 million rows). While when the inner table is a bit bigger (~480k rows (INT, DOUBLE)), the join takes about 20 mins for the same oute

Re: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Sudipto Das
Thanks. I switched to branch 0.4, and the hash join is working, even though it is running much slower that I expected. I will try to figure out the reason. PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto On Fri, Sep 11, 2009 at 3:13 PM, Namit Jain wrote: >

Re: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Sudipto Das
I checked join26.q. That wasn't what I meant. My query looks something like this: insert overwrite table join_result select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate, (r.rate - c.cparam*m.mparam) from mparam m JOIN data r ON (m.mid = r.mid) JOIN cparam c ON (c.cid = r.cid); mpar

Re: Strange behavior during Hive queries

2009-09-11 Thread Edward Capriolo
On Fri, Sep 11, 2009 at 5:28 PM, Todd Lipcon wrote: > Hrm... sorry, I didn't read your original query closely enough. > > I'm not sure what could be causing this. The map.tasks.maximum parameter > shouldn't affect it at all - it only affects the number of slots on the > trackers. > > By any chance

Re: Strange behavior during Hive queries

2009-09-11 Thread Todd Lipcon
Hrm... sorry, I didn't read your original query closely enough. I'm not sure what could be causing this. The map.tasks.maximum parameter shouldn't affect it at all - it only affects the number of slots on the trackers. By any chance do you have mapred.max.maps.per.node set? This is a configuratio

RE: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Namit Jain
That is not true - look at the unit test join26.q From: Sudipto Das [mailto:sudipt...@gmail.com] Sent: Friday, September 11, 2009 2:07 PM To: hive-user@hadoop.apache.org Subject: Re: Directing Hive to perform Hash Join for small inner tables But this creates a join where each join is performed i

Re: Strange behavior during Hive queries

2009-09-11 Thread Brad Heintz
Todd - Of course; it makes sense that it would be that way. But I'm still left wondering why, then, my Hive queries are only using 2 mappers per task tracker when other jobs use 7. I've gone so far as to diff the job.xml files from a regular job and a Hive query, and didn't turn up anything - th

Re: Strange behavior during Hive queries

2009-09-11 Thread Todd Lipcon
Hi Brad, mapred.tasktracker.map.tasks.maximum is a parameter read by the TaskTracker when it starts up. It cannot be changed per-job. Hope that helps -Todd On Fri, Sep 11, 2009 at 2:06 PM, Brad Heintz wrote: > TIA if anyone can point me in the right direction on this. > > I'm running a simple

Re: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Sudipto Das
But this creates a join where each join is performed in a single Map only MR join, which is as good as specifying the query as MAPJOIN(x) followed by another query as MAPJOIN(y) with the result of the previous join. Is there as way to make it pick just one MR job, where both the small inner tables

Strange behavior during Hive queries

2009-09-11 Thread Brad Heintz
TIA if anyone can point me in the right direction on this. I'm running a simple Hive query (a count on an external table comprising 436 files, each of ~2GB). The cluster's mapred-site.xml specifies mapred.tasktracker.map.tasks.maximum = 7 - that is, 7 mappers per worker node. When I run regular

Re: which thrift reversion do you use ?

2009-09-11 Thread Raghu Murthy
I think we used r790732 the last time we made changes to the hive service interface. raghu On 9/11/09 10:18 AM, "Bill Graham" wrote: > +1 > > I've been struggling with thrift versions as well, see: > https://issues.apache.org/jira/browse/HIVE-795?focusedCommentId=12754020&page= > com.atlassia

RE: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Namit Jain
Yes, you can specify the list of tables in the hint MAPJOIN(x,y,z) From: Sudipto Das [mailto:sudipt...@gmail.com] Sent: Friday, September 11, 2009 1:17 PM To: hive-user@hadoop.apache.org Subject: Re: Directing Hive to perform Hash Join for small inner tables Is there any performance differe

Re: General design/schema question

2009-09-11 Thread Prasad Chakka
A partitioned table has a set of partition keys. Check wiki on how to create partitioned table. In your case you can one partition key named 'ds' (datestamp). You can choose any format for values but commonly chosen one is '-MM-DD'. You can specify the partition while loading data by ' PART

Re: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Sudipto Das
Is there any performance difference between 0.4 and trunk? I can then temporarily switch to 0.4 while the problem is being fixed. Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join, where two tables can fit in memory. I tried some intuitive alternatives but did not work. Th

Re: General design/schema question

2009-09-11 Thread Mayuran Yogarajah
Prasad Chakka wrote: You should create a daily partition table. So you just need to create a new partition which is automatic if you use ‘LOAD DATA... INTO TABLE ... PARTITION (ds=’2009-09-01’)’ Prasad Just wanted to clarify, I still need to do LOAD DATA .. INTO TABLE .. PARTITION (day='hdfs

Re: General design/schema question

2009-09-11 Thread Edward Capriolo
On Fri, Sep 11, 2009 at 3:26 PM, Prasad Chakka wrote: > You should create a daily partition table. So you just need to create a new > partition which is automatic if you use ‘LOAD DATA... INTO TABLE ... > PARTITION (ds=’2009-09-01’)’ > > Prasad > > > > From: Mayura

Re: General design/schema question

2009-09-11 Thread Prasad Chakka
You should create a daily partition table. So you just need to create a new partition which is automatic if you use 'LOAD DATA... INTO TABLE ... PARTITION (ds='2009-09-01')' Prasad From: Mayuran Yogarajah Reply-To: Date: Fri, 11 Sep 2009 12:20:25 -0700 To: S

General design/schema question

2009-09-11 Thread Mayuran Yogarajah
We have our files in HDFS laid out by day like this: 2009-09-01/files 2009-09-02/files 2009-09-03/files Loading this data into Hive would mean creating a new table per day! I'm thinking this might be a common issue though, since others most likely do batch processing on a daily/nightly basis.

RE: Directing Hive to perform Hash Join for small inner tables

2009-09-11 Thread Namit Jain
This is some problem in trunk - it runs fine in 0.4 - I will take a look From: Sudipto Das [mailto:sudipt...@gmail.com] Sent: Thursday, September 10, 2009 6:16 PM To: hive-user@hadoop.apache.org Subject: Re: Directing Hive to perform Hash Join for small inner tables Hi Namit, The join column is

Re: Multiple rows from a single row

2009-09-11 Thread Sudipto Das
Thanks... I'll look into it and get back if I have any more questions. PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto On Fri, Sep 11, 2009 at 11:17 AM, Raghu Murthy wrote: > You will need to write a custom mapper script using a TRANSFORM clause. See > http

Re: Multiple rows from a single row

2009-09-11 Thread Raghu Murthy
You will need to write a custom mapper script using a TRANSFORM clause. See http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform On 9/11/09 11:04 AM, "Sudipto Das" wrote: > Hi, > > I have a scenario where in a query, I need to split up a single row and output > 3 different rows. Is this

Multiple rows from a single row

2009-09-11 Thread Sudipto Das
Hi, I have a scenario where in a query, I need to split up a single row and output 3 different rows. Is this supported by Hive, and if yes, can someone point me to the syntax? Thanks Sudipto PhD Candidate CS @ UCSB Santa Barbara, CA 93106, USA http://www.cs.ucsb.edu/~sudipto

Re: which thrift reversion do you use ?

2009-09-11 Thread Bill Graham
+1 I've been struggling with thrift versions as well, see: https://issues.apache.org/jira/browse/HIVE-795?focusedCommentId=12754020&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12754020 Any insight into which version of thrift the Hive trunk is using would be hel

RE: Files does not exist error: concurrency control on hive queries...

2009-09-11 Thread Ashish Thusoo
Another option is to deal with this using versioning. Some ideas on this are at https://issues.apache.org/jira/browse/HIVE-718 Ashish From: Eva Tse [e...@netflix.com] Sent: Wednesday, September 09, 2009 10:45 PM To: hive-user@hadoop.apache.org Subject: Re:

which thrift reversion do you use ?

2009-09-11 Thread Min Zhou
Hi all, we've tried newest one from trunk and r760184, both of them can't produce the same code with hive trunk. which thrift reversion do you use ? Thanks, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linke

RE: enforcing query with partition column

2009-09-11 Thread Ashish Thusoo
What does explain show? An easy work around for now is to push the partition predicate into a subquery on the table. Ashish From: Abhijit Pol [a...@rocketfuelinc.com] Sent: Thursday, September 10, 2009 10:40 PM To: hive-user@hadoop.apache.org Subject: Re: