[ https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096809#comment-13096809 ]
Jeremy Hanna commented on CASSANDRA-3134: ----------------------------------------- fwiw - it might be simpler but not sure that you necessarily need CDH's streaming jar. Could HADOOP-1722 be backported to 0.20.203 by itself? That would allow it to be seamlessly integrated into Brisk as well. btw, this sounds great - both streaming support as well as seamless support in hadoopy and dumbo. > Patch Hadoop Streaming Source to Support Cassandra IO > ----------------------------------------------------- > > Key: CASSANDRA-3134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3134 > Project: Cassandra > Issue Type: New Feature > Components: Hadoop > Reporter: Brandyn White > Priority: Minor > Labels: hadoop, hadoop_examples_streaming > Original Estimate: 504h > Remaining Estimate: 504h > > (text is a repost from > [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497]) > I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python > library and I'm interested in taking another stab at streaming support. > Hadoopy and Dumbo both use the TypedBytes format that is in CDH for > communication with the streaming jar. A simple way to get this to work is > modify the streaming code (make hadoop-cassandra-streaming.jar) so that it > uses the same TypedBytes communication with streaming programs, but the > actual job IO is using the Cassandra IO. The user would have the exact same > streaming interface, but the user would specify the keyspace, etc using > environmental variables. > The benefits of this are > 1. Easy implementation: Take the cloudera-patched version of streaming and > change the IO, add environmental variable reading. > 2. Only Client side: As the streaming jar is included in the job, no server > side changes are required. > 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this > would require the same simple fixup as any other Hadoop job. > 4. The TypedBytes format supports all of the necessary Cassandara types > (https://issues.apache.org/jira/browse/HADOOP-5450) > 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only > need to know the path of this new streaming jar > 6. No need for avro > The negatives of this are > 1. Duplicative code: This would be a dupe and patch of the streaming jar. > This can be stored itself as a patch. > 2. I'd have to check but this solution should work on a stock hadoop (cluster > side) but it requires TypedBytes (client side) which can be included in the > jar. > I can code this up but I wanted to get some feedback from the community first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira