Hive Reduce error

2011-09-02 Thread Ayon Sinha
Hi,
I'm pretty sure I've seen this error before on a regular hadoop job but I don't 
know how to fix this. Can anyone hint to what might be causing this? I'm 
runnign Brisk hive, but I think this is a more generic Hadoop error caused by 
some setting I have wrong.

java.lang.RuntimeException: Hive Runtime Error while closing operators: 
java.io.IOException: TimedOutException()Unable to rename output to: 
cfs://null/tmp/hive-root/hive_2011-09-03_01-02-52_114_2999445482253169407/_tmp.-ext-10001/00_0
 at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:311) at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:528) at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:419) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:259) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: TimedOutException()Unable to rename output to: 
cfs://null/tmp/hive-root/hive_2011-09-03_01-02-52_114_2999445482253169407/_tmp.-ext-10001/00_0
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.commit(FileSinkOperator.java:186)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.access$200(FileSinkOperator.java:98)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:644)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557) at 
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566) at 
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566) at 
org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.


Re: Hive + Cassandra?

2011-09-02 Thread Joe Key
@Ayon
Are you certain that Brisk is a commercial version?  Their datasheet states
that "Datastax' Brisk is an enhanced open-source distribution".  While I
have no idea what "enhanced" means (perhaps they offer expertise at a fee),
their public github for the project proudly displays an Apache License.

SOURCES:
https://github.com/riptano/brisk/blob/beta2/LICENSE.txt
http://www.datastax.com/wp-content/uploads/2011/03/WP-Brisk.pdf


On Fri, Sep 2, 2011 at 7:37 AM, Edward Capriolo wrote:

>
>
> On Fri, Sep 2, 2011 at 12:58 AM, Ayon Sinha  wrote:
>
>> Hi,
>> I'm looking for the status of the Open source Apache project that is
>> integrating Hive & Cassandra. I was under the impression that Datastax'
>> Brisk is a commercial version of that but I'm looking for the original. BTW,
>> Brisk Beta 2 release was pain-free to install and run but it doesn't return
>> results. It either will return the entire table or return an empty set. But
>> thats not relevant in this forum. Just looking for the Apache project.
>>
>> -Ayon
>> See My Photos on Flickr 
>> Also check out my Blog for answers to commonly asked 
>> questions.
>>
>
> Ayon,
>
> The original issue to follow was
> https://issues.apache.org/jira/browse/HIVE-1434 . Hive-Cassandra never
> made it into hive-trunk due to many constraints in the testing environment
> that made the process hard to evolve. The datastax crew was hungry to hack
> at it so that code lives with them now. In the future it would be great if
> we can bring all the cool things brisk has backinto hive mainline. You can
> get some help here or ask me on #hive irc, but your best bet is
> #datastax-brisk on IRC or with datastax support.
>
> For reference, it is possible to take the cassandra handler jars from brisk
> and drop them into a hive release . This allows you to use the cassandra
> handler without using the other parts of brisk.
>
>
> https://github.com/riptano/hive/wiki/Cassandra-Handler-usage-in-Hive-0.7-with-Cassandra-0.7
>
> Edward
>



-- 
Joe Andrew Key (Andy)


Re: Google Protocol Buffers and Hive

2011-09-02 Thread valentina kroshilina
You can still partition the data. You'll have to run queries to add
partitions to the table, otherwise your table won't see a new partition, but
you'll have to do it regardless on what type of table you use.

We have a big cluster so I don't really see any change in performance, Hive
for this type of data is relatively fast.

For some cases GPB has advantages over plain text, so it depends...

On Fri, Sep 2, 2011 at 2:57 PM, Matias Silva wrote:

> Hi Valentina, thanks for your response.  Do you think using external
> tables, I can still partition the data?  I do like
> the external table idea because that will save from having to do an
> additional import of the data into hive from just loading
> into HDFS.   Plus it will save on space.
>
> How is the performance using GPB/Hive?
>
> Another thing I think we can do is use the pig/elephant bird to read the
> GPB files and then write them out to a tab delimited, plain text format
> and import the data into hive.  This would be a copy of the data, but would
> it be cleaner.
>
> Thanks,
> Matt
>
>
> On Sep 2, 2011, at 9:43 AM, valentina kroshilina wrote:
>
> > I use MR to generate tables using Elephant-Bird's OutputFormat. Hive
> > can read from EXTERNAL tables using ProtobufHiveSerde and
> > ProtobufBlockInputFormat generated by Elephant-Bird. Create table
> > statement looks like the following:
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS TABLE_NAME
> > (
> > ...
> > )
> > ROW FORMAT SERDE 'elephantbird.proto.hive.serde.LzoXXXProtobufHiveSerde'
> > STORED AS
> > inputformat
> 'elephantbird.proto.mapred.input.DeprecatedLzoXXXProtobufBlockInputFormat'
> > outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
> > LOCATION '/PATH';
> >
> > So the solution is to use external tables.
> >
> > Let me know if it helps.
> >
> > On Thu, Sep 1, 2011 at 8:45 PM, Matias Silva 
> wrote:
> >> Hi Everyone, is there any documentation regarding importing
> >> GoogleProtocolBuffer files into Hive.  I'm scouring over the internet
> >> and the closest thing I came
> >> across http://search-hadoop.com/m/9zF4MEW5Od1/v=plain
> >> I saw something from Elephant-Bird where I can load the GPB file using
> pig
> >> and then store it in a plain text format and then load
> >> into Hive.  It would be great if I can just load from GPB directly into
> >> Hive.
> >> Any pointers?
> >> Thanks for your time and knowledge,
> >> Matt
> >>
> >>
>
>
> Matias Silva   [Sr. Data Warehouse Developer]
> p 949.861. x1420  f 949.861.8990
> specificmedia.com
>
>
>
>


Best practices for storing data on Hive

2011-09-02 Thread Mark Grover
Hello folks,
I am fairly new to Hive and am wondering if you could share some of the best 
practices for storing/querying data with Hive.

Here is an example of the problem I am trying to solve.

The traffic to our website is logged in files that contain information about 
clicks from various users.
Simplified, the log file looks like:
t_1, ip_1, userid_1
t_2, ip_2, userid_2
t_3, ip_3, userid_3
...

where t_i represents time of the click, ip_i represents ip address where the 
click originated from, and userid_i represents the user ID of the user.

Since the clicks are logged on an ongoing basis, partitioning our Hive table by 
day seemed like the obvious choice. Every night we upload the data from the 
previous day into a new partition.

However, we would also want the capability to find all log lines corresponding 
to a particular user. With our present partitioning scheme, all day partitions 
are searched for that user ID but this takes a long time. I am looking for 
ideas/suggestions/thoughts/comments on how to reduce this time.

As a solution, I am thinking that perhaps we could have 2 independent tables, 
one which stores data partitioned by day and the other partitioned by userId. 
With the second table partitioned by userId, I will have to find some way of 
maintaining the partitions since Hive doesn't support appending of files. Also, 
this seems suboptimal, since we are doubling that the amount of data that we 
store. What do you folks think of this idea? 

Do you have any other suggestions on how we can approach this problem?

What have other people in similar situations done? Please share.

Thank you in advance!
Mark


Re: Google Protocol Buffers and Hive

2011-09-02 Thread valentina kroshilina
I use MR to generate tables using Elephant-Bird's OutputFormat. Hive
can read from EXTERNAL tables using ProtobufHiveSerde and
ProtobufBlockInputFormat generated by Elephant-Bird. Create table
statement looks like the following:

CREATE EXTERNAL TABLE IF NOT EXISTS TABLE_NAME
(
...
)
ROW FORMAT SERDE 'elephantbird.proto.hive.serde.LzoXXXProtobufHiveSerde'
STORED AS
inputformat 
'elephantbird.proto.mapred.input.DeprecatedLzoXXXProtobufBlockInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION '/PATH';

So the solution is to use external tables.

Let me know if it helps.

On Thu, Sep 1, 2011 at 8:45 PM, Matias Silva  wrote:
> Hi Everyone, is there any documentation regarding importing
> GoogleProtocolBuffer files into Hive.  I'm scouring over the internet
> and the closest thing I came
> across http://search-hadoop.com/m/9zF4MEW5Od1/v=plain
> I saw something from Elephant-Bird where I can load the GPB file using pig
> and then store it in a plain text format and then load
> into Hive.  It would be great if I can just load from GPB directly into
> Hive.
> Any pointers?
> Thanks for your time and knowledge,
> Matt
>
>


Re: Hive + Cassandra?

2011-09-02 Thread Edward Capriolo
On Fri, Sep 2, 2011 at 12:58 AM, Ayon Sinha  wrote:

> Hi,
> I'm looking for the status of the Open source Apache project that is
> integrating Hive & Cassandra. I was under the impression that Datastax'
> Brisk is a commercial version of that but I'm looking for the original. BTW,
> Brisk Beta 2 release was pain-free to install and run but it doesn't return
> results. It either will return the entire table or return an empty set. But
> thats not relevant in this forum. Just looking for the Apache project.
>
> -Ayon
> See My Photos on Flickr 
> Also check out my Blog for answers to commonly asked 
> questions.
>

Ayon,

The original issue to follow was
https://issues.apache.org/jira/browse/HIVE-1434 . Hive-Cassandra never made
it into hive-trunk due to many constraints in the testing environment that
made the process hard to evolve. The datastax crew was hungry to hack at it
so that code lives with them now. In the future it would be great if we can
bring all the cool things brisk has backinto hive mainline. You can get some
help here or ask me on #hive irc, but your best bet is #datastax-brisk on
IRC or with datastax support.

For reference, it is possible to take the cassandra handler jars from brisk
and drop them into a hive release . This allows you to use the cassandra
handler without using the other parts of brisk.

https://github.com/riptano/hive/wiki/Cassandra-Handler-usage-in-Hive-0.7-with-Cassandra-0.7

Edward