[ 
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096908#comment-13096908
 ] 

Brandyn White commented on CASSANDRA-3134:
------------------------------------------

{quote}
Could HADOOP-1722 be backported to 0.20.203 by itself? That would allow it to 
be seamlessly integrated into Brisk as well.
{quote}
Yes it could be.  Basically that is the only thing that Hadoopy/Dumbo require 
to communicate with streaming.

{quote}
Brandyn, to Jonathan's point though - is it possible instead of using 
TypedBytes, to use Cassandra's serialization methods via AbstractBytes to do 
this.
{quote}
So I'll give a brief overview of how I think this whole thing would come 
together (please correct me if something isn't right).  The data is always in a 
binary native representation in Cassandra (the data doesn't have a type encoded 
with it) but the comparator(column names)/validator(row key and column value) 
classes define how to interpret the data.  As long as they can do the SerDe of 
the raw data they don't mind.  The types themselves are encoded in the column 
family metadata and all rows, column names, and column values must have a 
single type per column family (3 java classes uniquely define a column family's 
type, 4 for super CFs).

In the Hadoop Streaming code, the InputFormat would give us back values as 
ByteBuffer types (see 
[example|https://github.com/apache/cassandra/blob/trunk/examples/hadoop_word_count/src/WordCount.java#L78]).
  Is there a way to get the type classes inside of Hadoop?  Like are those 
passed in through a jobconf or a method call?  If so, then the simplest thing 
is to deserialize them to standard writables, then use the java TypedBytes 
serialization code in HADOOP-1722 to send them to the client streaming program.

The reason that we need TypedBytes comes up at this point, if we just send the 
raw data then the streaming program is unaware of how to convert it.  So 
assuming there is a way to get the types classes for the column family in 
Hadoop, the conversion can either happen in the Hadoop Streaming code or in the 
client's code.  The problem is that the data stream is just a concatenation of 
the output bytes, TypedBytes are self delimiting which makes it possible to 
figure out key/value pair data boundaries; however, if you just output binary 
data for the key/value pairs and somehow communicate the types, the client 
streaming program would be unable to parse variable length data (strings).  
Moreover, the backwards compatibility you get by using TypedBytes is a big win 
compared to making the AbstractType's self delimiting.  If this becomes an 
issue (speed), we could make a direct binary data to TypedBytes conversion that 
would basically just slap on a typecode and (for variable length data) size 
(see TypedBytes format below).

One minor issue is that while there is support for Map/Dictionaries in 
TypedBytes, it isn't necessarily an OrderedDict.  We have a few options here: 
1.) Just use a dict, 2.) Since the client will have to know somewhat that it is 
using Cassandra (to provide Keyspace, CF name, etc.) we can easily use an 
OrderedDict instead of Dict, 3.) Use a list of tuples to encode the data, and 
4.) Add a new custom typecode for OrderedDict.  Of these I think #2 is the best 
client side and it is simple to do in the Hadoop Streaming code (#1 and #2 only 
differ client side, so unaware clients would simply interpret it as a 
dictionary).

So that we're all on the same page I put together a few links.

AbstractType Resources
CASSANDRA-2530
[AbstractType Class 
Def|http://javasourcecode.org/html/open-source/cassandra/cassandra-0.8.1/org/apache/cassandra/db/marshal/AbstractType.html]
[Pycassa Type Conversion 
Code|https://github.com/pycassa/pycassa/blob/0ce77bbda2917039ef82561d70b4a063c1f66224/pycassa/util.py#L137]
[Python Struct Syntax (to interpret the above code)| 
http://docs.python.org/library/struct.html#format-characters]

TypedBytes resources
[Data 
Format|http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/typedbytes/package-summary.html]
[Hadoopy Cython code to read TypedBytes (uses C 
IO)|https://github.com/bwhite/hadoopy/blob/master/hadoopy/_typedbytes.pyx#L53]


> Patch Hadoop Streaming Source to Support Cassandra IO
> -----------------------------------------------------
>
>                 Key: CASSANDRA-3134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Hadoop
>            Reporter: Brandyn White
>            Priority: Minor
>              Labels: hadoop, hadoop_examples_streaming
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (text is a repost from 
> [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
> I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python 
> library and I'm interested in taking another stab at streaming support. 
> Hadoopy and Dumbo both use the TypedBytes format that is in CDH for 
> communication with the streaming jar. A simple way to get this to work is 
> modify the streaming code (make hadoop-cassandra-streaming.jar) so that it 
> uses the same TypedBytes communication with streaming programs, but the 
> actual job IO is using the Cassandra IO. The user would have the exact same 
> streaming interface, but the user would specify the keyspace, etc using 
> environmental variables.
> The benefits of this are
> 1. Easy implementation: Take the cloudera-patched version of streaming and 
> change the IO, add environmental variable reading.
> 2. Only Client side: As the streaming jar is included in the job, no server 
> side changes are required.
> 3. Simple maintenance: If the Hadoop Cassandra interface changes, then this 
> would require the same simple fixup as any other Hadoop job.
> 4. The TypedBytes format supports all of the necessary Cassandara types 
> (https://issues.apache.org/jira/browse/HADOOP-5450)
> 5. Compatible with existing streaming libraries: Hadoopy and dumbo would only 
> need to know the path of this new streaming jar
> 6. No need for avro
> The negatives of this are
> 1. Duplicative code: This would be a dupe and patch of the streaming jar. 
> This can be stored itself as a patch.
> 2. I'd have to check but this solution should work on a stock hadoop (cluster 
> side) but it requires TypedBytes (client side) which can be included in the 
> jar.
> I can code this up but I wanted to get some feedback from the community first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to