[ 
https://issues.apache.org/jira/browse/CASSANDRA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276166#comment-13276166
 ] 

Ilya Maykov commented on CASSANDRA-2527:
----------------------------------------

We wrote a Hadoop InputFormat class that could read SSTable files directly, 
completely bypassing the Cassandra server - not that hard to do as the SSTable 
file format is pretty simple. Then we exported the snapshot directories over 
NFS to our hadoop workers and ran the MR job that way. Obviously only useful if 
you want to iterate through all of the data in your Cassandra cluster. Also has 
a lot of overhead - this approach reads through stale versions of data that 
haven't been compacted away yet, and reads RF replicas of each row ... exposing 
snapshots in special snapshot keyspaces so they could be mapped using stock 
hadoop mappers may be a better way to go.
                
> Add ability to snapshot data as input to hadoop jobs
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2527
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2527
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jeremy Hanna
>              Labels: hadoop
>
> It is desirable to have immutable inputs to hadoop jobs for the duration of 
> the job.  That way re-execution of individual tasks do not alter the output.  
> One way to accomplish this would be to snapshot the data that is used as 
> input to a job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to