[ 
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096811#comment-13096811
 ] 

Brandyn White commented on CASSANDRA-3134:
------------------------------------------

So the only requirement is that it have TypedBytes support.  I personally use 
CDH but I believe it was accepted upstream in [Hadoop 
.21|http://hadoop.apache.org/common/docs/r0.21.0/changes.html].  So this would 
work in Vanilla .21 and CDH 2/3.

> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>
>                 Key: CASSANDRA-3134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Brandyn White
>            Priority: Minor
>              Labels: hadoop, hadoop_examples_streaming
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (text is a repost from 
> [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
> I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python 
> library and I'm interested in taking another stab at streaming support. 
> Hadoopy and Dumbo both use the TypedBytes format that is in CDH for 
> communication with the streaming jar. A simple way to get this to work is 
> modify the streaming code (make hadoop-cassandra-streaming.jar) so that it 
> uses the same TypedBytes communication with streaming programs, but the 
> actual job IO is using the Cassandra IO. The user would have the exact same 
> streaming interface, but the user would specify the keyspace, etc using 
> environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and 
> change the IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server 
> side changes are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this 
> would require the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types 
> (https://issues.apache.org/jira/browse/HADOOP-5450)
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only 
> need to know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar. 
> This can be stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster 
> side) but it requires TypedBytes (client side) which can be included in the 
> jar.
> I can code this up but I wanted to get some feedback from the community first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to