[ https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126852#comment-13126852 ]
Brandon Williams commented on CASSANDRA-3134: --------------------------------------------- It seems like TypedBytes is the Hadoop Way, so I think I'm ok with going with that instead of using AbstractBytes. > Patch Hadoop Streaming Source to Support Cassandra IO > ----------------------------------------------------- > > Key: CASSANDRA-3134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3134 > Project: Cassandra > Issue Type: New Feature > Components: Hadoop > Reporter: Brandyn White > Priority: Minor > Labels: hadoop, hadoop_examples_streaming > Original Estimate: 504h > Remaining Estimate: 504h > > (text is a repost from > [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497]) > I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python > library and I'm interested in taking another stab at streaming support. > Hadoopy and Dumbo both use the TypedBytes format that is in CDH for > communication with the streaming jar. A simple way to get this to work is > modify the streaming code (make hadoop-cassandra-streaming.jar) so that it > uses the same TypedBytes communication with streaming programs, but the > actual job IO is using the Cassandra IO. The user would have the exact same > streaming interface, but the user would specify the keyspace, etc using > environmental variables. > The benefits of this are > 1. Easy implementation: Take the cloudera-patched version of streaming and > change the IO, add environmental variable reading. > 2. Only Client side: As the streaming jar is included in the job, no server > side changes are required. > 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this > would require the same simple fixup as any other Hadoop job. > 4. The TypedBytes format supports all of the necessary Cassandara types > (https://issues.apache.org/jira/browse/HADOOP-5450) > 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only > need to know the path of this new streaming jar > 6. No need for avro > The negatives of this are > 1. Duplicative code: This would be a dupe and patch of the streaming jar. > This can be stored itself as a patch. > 2. I'd have to check but this solution should work on a stock hadoop (cluster > side) but it requires TypedBytes (client side) which can be included in the > jar. > I can code this up but I wanted to get some feedback from the community first. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira