Re: A simple performance benchmark for Hadoop, Hive and Pig

2009-06-19 Thread Ashish Thusoo
This numbers are definitely preliminary and the reason that we send them out was to involve the community from the get go and have them critique this work. The mistake though was sending this out on the users list as opposed to the dev lists. Regarding better than map/reduce I think that the nu

RE: Linking against Hive in Hadoop development tree

2009-05-20 Thread Ashish Thusoo
You could either do what Owen suggested and put the plugin in hive contrib, or you could just put the whole thing in hive contrib as then you would have access to all the lower level api (core, hdfs, hive etc.). Owen's approach makes a lot of sense if you think that the hive dependency is a loos

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Ok that explains a lot of that. When we started off Hive our immediate usecase was to do group bys on data with a lot of skew on the grouping keys. In that scenario it is better to do this in 2 map/reduce jobs using the first one to randomly distribute data and generating the partial sums follow

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Scott, Namit is actually correct. If you do a explain on the query that he sent out, you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified that and that is consistent with what we should expect in this case. We would be very interested to know the exact query that you us

RE: PIG and Hive

2009-05-06 Thread Ashish Thusoo
Ricky, For your particular example Hive allows you to plugin a user defined map and reduce script (in the language of your choice) within Hive QL (there are some minor extensions to SQL to support such a use case). So for your case you could do the following: FROM (FROM lines MAP line US

RE: ApacheCon US 2008

2008-10-31 Thread Ashish Thusoo
Owen, Just wanted to mention that there is a talk on Hive as well on Friday 9:30AM... Ashish -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 10:47 AM To: [EMAIL PROTECTED] Cc: core-user@hadoop.apache.org Subject: ApacheCon US 2008 Just a r

RE: [hive-users] Hive Roadmap (Some information)

2008-10-27 Thread Ashish Thusoo
we could get out of byte code generation. What kind of performance speedups have you seen with byte code generation in data processing applications? Ashish -Original Message- From: Ben Maurer [mailto:[EMAIL PROTECTED] Sent: Monday, October 27, 2008 1:08 PM To: Ashish Thusoo Cc: [EMAIL

Hive Roadmap (Some information)

2008-10-27 Thread Ashish Thusoo
Folks, Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well. 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows the us

RE: Hive questions about the meta db

2008-10-02 Thread Ashish Thusoo
Hi Edward, Can you send us the contents of /tmp//hive.log. Also lets open a JIRA for this and carry out the discussion there - even if this is not a bug (which it may turn out to be), NullPointerException is not the most useful user visible message, so atleast that we should fix... Thanks, Ash

RE: Hive questions about the meta db

2008-10-01 Thread Ashish Thusoo
Hi Edward, You can have multiple instances of hive by pointing the hive cli to different configs (This is very similar to the hadoop model). Take a look at hive-default.xml in you hive instance. You can create different copies of this file and change the following properties: hive.metastore.wa

RE: Reading and writing Thrift data from MapReduce

2008-09-04 Thread Ashish Thusoo
Hi Juho, Hive can support your partitioning scheme. Just use the partition by clause in the create table statement to identify the partitioning columns with the top level partitioning column being the first one in the list. Ashish -Original Message- From: Jeff Hammerbacher [mailto:[EM

RE: FW: [jira] Updated: (HADOOP-3601) Hive as a contrib project

2008-07-09 Thread Ashish Thusoo
2008 at 9:47 AM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: > Hi Folks, > > We recently opened up a JIRA in order to bring Hive into the open > source fold with the aim of contributing back to hadoop - which has > really made large scale data processing so much easier for us at

FW: [jira] Updated: (HADOOP-3601) Hive as a contrib project

2008-07-09 Thread Ashish Thusoo
- From: Ashish Thusoo (JIRA) [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 08, 2008 4:15 PM To: Ashish Thusoo Subject: [jira] Updated: (HADOOP-3601) Hive as a contrib project [ https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jir a.plugin.system.issuetabpanels:all-tabpanel

RE: Difference between joining and reducing

2008-07-03 Thread Ashish Thusoo
Hi Stuart, Join is a higher level logical operation while map/reduce is a technique that could be used implement it. Specifically, in relational algebra, the join construct specifies how to form a single output row from 2 rows arising from two input streams. There are very many ways of implemen

RE: hadoop benchmarked, too slow to use

2008-06-11 Thread Ashish Thusoo
, and there were > no timeouts. > > 332/214 = 55% more time with 5/7 = 71% servers. > > so our conclusion is that more servers will make the cluster faster. > > > > Ashish Thusoo wrote: >> Try by first just reducing the number of files and increasing the >>

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Thusoo
r the smaller set > > i uploaded the same data 10 times in different directories ( so more > files, same size ) > > > Ashish Thusoo wrote: >> Apart from the setup times, the fact that you have 3500 files means >> that you are going after around 220GB of data as each file

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Thusoo
Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read

RE: Simple question: call collect multiple times?

2008-06-09 Thread Ashish Thusoo
I you are asking whether you can call collect many times for every row being processed, the answer is yes. MR does not put any restrictions on how many output key, value pairs you can produce for every input key value pair. Ashish -Original Message- From: Colin Freas [mailto:[EMAIL PROTE

RE: File Per Column in Hadoop

2008-03-11 Thread Ashish Thusoo
This is very interesting and very useful. There was some work done in the database community to look at different block organizations that boost cache and I/O performance and essentially they also proposed a scheme similar to what you are talking about (although at a database block level) Link is

RE: Is anyone looking at this LeaseExpiration issue??

2008-02-20 Thread Ashish Thusoo
Hi Raghu, I am rerunning some of the things that failed to see if it happens again. So far it happened once, and basically caused a long running job to fail at the very end. I will also get the info from the namenode log that you mentioned in the JIRA and put it up there. Thanks, Ashish -O

RE: Speculative execution and output directory

2008-02-11 Thread Ashish Thusoo
and output directory On Feb 9, 2008, at 3:52 PM, Ashish Thusoo wrote: > Hi Hadoop users, > > > > We have intermittently hit issues with speculative execution and > hadoop > streaming where we see a directory of the form > > _task_200__m_..._. > It'

Speculative execution and output directory

2008-02-09 Thread Ashish Thusoo
Hi Hadoop users, We have intermittently hit issues with speculative execution and hadoop streaming where we see a directory of the form _task_200__m_..._. formed in the output directory. Has anyone out there hit similar issues or knows what might be happening here? We did scan th