[jira] [Commented] (CASSANDRA-2527) Add ability to snapshot data as input to hadoop jobs
[ https://issues.apache.org/jira/browse/CASSANDRA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276166#comment-13276166 ] Ilya Maykov commented on CASSANDRA-2527: We wrote a Hadoop InputFormat class that could read SSTable files directly, completely bypassing the Cassandra server - not that hard to do as the SSTable file format is pretty simple. Then we exported the snapshot directories over NFS to our hadoop workers and ran the MR job that way. Obviously only useful if you want to iterate through all of the data in your Cassandra cluster. Also has a lot of overhead - this approach reads through stale versions of data that haven't been compacted away yet, and reads RF replicas of each row ... exposing snapshots in special snapshot keyspaces so they could be mapped using stock hadoop mappers may be a better way to go. Add ability to snapshot data as input to hadoop jobs Key: CASSANDRA-2527 URL: https://issues.apache.org/jira/browse/CASSANDRA-2527 Project: Cassandra Issue Type: Improvement Reporter: Jeremy Hanna Labels: hadoop It is desirable to have immutable inputs to hadoop jobs for the duration of the job. That way re-execution of individual tasks do not alter the output. One way to accomplish this would be to snapshot the data that is used as input to a job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972723#action_12972723 ] Ilya Maykov commented on CASSANDRA-1867: No problem, I'll see if there is a similar problem in the json2sstable path this weekend. Oh, and it's Maykov, not Mykov :) sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Ilya Maykov Assignee: Ilya Maykov Priority: Minor Fix For: 0.6.9, 0.7.0 Attachments: cassandra-1867-2.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Maykov updated CASSANDRA-1867: --- Attachment: cassandra-1867.txt Patch against 0.6 branch attached. sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Attachments: cassandra-1867.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12971948#action_12971948 ] Ilya Maykov commented on CASSANDRA-1867: Oops, the first patch is inserting an extra newline. Will fix momentarily. sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Attachments: cassandra-1867.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Maykov updated CASSANDRA-1867: --- Attachment: (was: cassandra-1867.txt) sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Attachments: cassandra-1867-2.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Maykov updated CASSANDRA-1867: --- Comment: was deleted (was: Fixed patch without the extra newline.) sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Attachments: cassandra-1867-2.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows
[ https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Maykov updated CASSANDRA-1867: --- Attachment: cassandra-1867-2.txt Fixed patch without the extra newline. sstable2json runs out of memory when trying to export huge rows --- Key: CASSANDRA-1867 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867 Project: Cassandra Issue Type: Improvement Components: Tools Affects Versions: 0.6.8 Reporter: Ilya Maykov Priority: Minor Attachments: cassandra-1867-2.txt Original Estimate: 1h Remaining Estimate: 1h Currently, sstable2json can run out of memory if it encounters a huge row. The problem is that it creates an in-memory String for each row. Proposed solution is to pass the output PrintStream to the serializeRow() and serializeColumns() methods and write to the stream incrementally. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.