This numbers are definitely preliminary and the reason that we send them out
was to involve the community from the get go and have them critique this work.
The mistake though was sending this out on the users list as opposed to the dev
lists.
Regarding better than map/reduce I think that the nu
You could either do what Owen suggested and put the plugin in hive contrib, or
you could just put the whole thing in hive contrib as then you would have
access to all the lower level api (core, hdfs, hive etc.). Owen's approach
makes a lot of sense if you think that the hive dependency is a loos
Ok that explains a lot of that. When we started off Hive our immediate usecase
was to do group bys on data with a lot of skew on the grouping keys. In that
scenario it is better to do this in 2 map/reduce jobs using the first one to
randomly distribute data and generating the partial sums follow
Scott,
Namit is actually correct. If you do a explain on the query that he sent out,
you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified
that and that is consistent with what we should expect in this case. We would
be very interested to know the exact query that you us
Ricky,
For your particular example Hive allows you to plugin a user defined map and
reduce script (in the language of your choice) within Hive QL (there are some
minor extensions to SQL to support such a use case). So for your case you could
do the following:
FROM (FROM lines
MAP line US
Owen,
Just wanted to mention that there is a talk on Hive as well on Friday 9:30AM...
Ashish
-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED]
Sent: Friday, October 31, 2008 10:47 AM
To: [EMAIL PROTECTED]
Cc: core-user@hadoop.apache.org
Subject: ApacheCon US 2008
Just a r
we could
get out of byte code generation. What kind of performance speedups have you
seen with byte code generation in data processing applications?
Ashish
-Original Message-
From: Ben Maurer [mailto:[EMAIL PROTECTED]
Sent: Monday, October 27, 2008 1:08 PM
To: Ashish Thusoo
Cc: [EMAIL
Folks,
Here are some of the things that we are working on internally at Facebook. We
thought it would be a good idea to let everyone know what is going on with Hive
development. We will put this up on the wiki as well.
1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows the us
Hi Edward,
Can you send us the contents of /tmp//hive.log. Also lets open a JIRA
for this and carry out the discussion there - even if this is not a bug (which
it may turn out to be), NullPointerException is not the most useful user
visible message, so atleast that we should fix...
Thanks,
Ash
Hi Edward,
You can have multiple instances of hive by pointing the hive cli to different
configs (This is very similar to the hadoop model). Take a look at
hive-default.xml in you hive instance. You can create different copies of this
file and change the following properties:
hive.metastore.wa
Hi Juho,
Hive can support your partitioning scheme. Just use the partition by clause in
the create table statement to identify the partitioning columns with the top
level partitioning column being the first one in the list.
Ashish
-Original Message-
From: Jeff Hammerbacher [mailto:[EM
2008 at 9:47 AM, Ashish Thusoo <[EMAIL PROTECTED]>
wrote:
> Hi Folks,
>
> We recently opened up a JIRA in order to bring Hive into the open
> source fold with the aim of contributing back to hadoop - which has
> really made large scale data processing so much easier for us at
-
From: Ashish Thusoo (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 08, 2008 4:15 PM
To: Ashish Thusoo
Subject: [jira] Updated: (HADOOP-3601) Hive as a contrib project
[
https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jir
a.plugin.system.issuetabpanels:all-tabpanel
Hi Stuart,
Join is a higher level logical operation while map/reduce is a technique that
could be used implement it. Specifically, in relational algebra, the join
construct specifies how to form a single output row from 2 rows arising from
two input streams. There are very many ways of implemen
, and there were
> no timeouts.
>
> 332/214 = 55% more time with 5/7 = 71% servers.
>
> so our conclusion is that more servers will make the cluster faster.
>
>
>
> Ashish Thusoo wrote:
>> Try by first just reducing the number of files and increasing the
>>
r the smaller set
>
> i uploaded the same data 10 times in different directories ( so more
> files, same size )
>
>
> Ashish Thusoo wrote:
>> Apart from the setup times, the fact that you have 3500 files means
>> that you are going after around 220GB of data as each file
Apart from the setup times, the fact that you have 3500 files means that
you are going after around 220GB of data as each file would have atleast
one chunk (this calculation is assuming a chunk size of 64MB and this
assumes that each file has atleast some data). Mappers would probably
need to read
I you are asking whether you can call collect many times for every row
being processed, the answer is yes. MR does not put any restrictions on
how many output key, value pairs you can produce for every input key
value pair.
Ashish
-Original Message-
From: Colin Freas [mailto:[EMAIL PROTE
This is very interesting and very useful.
There was some work done in the database community to look at different
block organizations that boost cache and I/O performance and essentially
they also proposed a scheme similar to what you are talking about
(although at a database block level)
Link is
Hi Raghu,
I am rerunning some of the things that failed to see if it happens
again. So far it happened once, and basically caused a long running job
to fail at the very end.
I will also get the info from the namenode log that you mentioned in the
JIRA and put it up there.
Thanks,
Ashish
-O
and output directory
On Feb 9, 2008, at 3:52 PM, Ashish Thusoo wrote:
> Hi Hadoop users,
>
>
>
> We have intermittently hit issues with speculative execution and
> hadoop
> streaming where we see a directory of the form
>
> _task_200__m_..._.
>
It'
Hi Hadoop users,
We have intermittently hit issues with speculative execution and hadoop
streaming where we see a directory of the form
_task_200__m_..._.
formed in the output directory. Has anyone out there hit similar issues
or knows what might be happening here? We did scan th
22 matches
Mail list logo