[jira] [Commented] (CASSANDRA-2527) Add ability to snapshot data as input to hadoop jobs

2012-05-15 Thread Ilya Maykov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276166#comment-13276166
 ] 

Ilya Maykov commented on CASSANDRA-2527:


We wrote a Hadoop InputFormat class that could read SSTable files directly, 
completely bypassing the Cassandra server - not that hard to do as the SSTable 
file format is pretty simple. Then we exported the snapshot directories over 
NFS to our hadoop workers and ran the MR job that way. Obviously only useful if 
you want to iterate through all of the data in your Cassandra cluster. Also has 
a lot of overhead - this approach reads through stale versions of data that 
haven't been compacted away yet, and reads RF replicas of each row ... exposing 
snapshots in special snapshot keyspaces so they could be mapped using stock 
hadoop mappers may be a better way to go.

 Add ability to snapshot data as input to hadoop jobs
 

 Key: CASSANDRA-2527
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2527
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jeremy Hanna
  Labels: hadoop

 It is desirable to have immutable inputs to hadoop jobs for the duration of 
 the job.  That way re-execution of individual tasks do not alter the output.  
 One way to accomplish this would be to snapshot the data that is used as 
 input to a job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-17 Thread Ilya Maykov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972723#action_12972723
 ] 

Ilya Maykov commented on CASSANDRA-1867:


No problem, I'll see if there is a similar problem in the json2sstable path 
this weekend.

Oh, and it's Maykov, not Mykov :)

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Ilya Maykov
Assignee: Ilya Maykov
Priority: Minor
 Fix For: 0.6.9, 0.7.0

 Attachments: cassandra-1867-2.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)
sstable2json runs out of memory when trying to export huge rows
---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor


Currently, sstable2json can run out of memory if it encounters a huge row. The 
problem is that it creates an in-memory String for each row. Proposed solution 
is to pass the output PrintStream to the serializeRow() and serializeColumns() 
methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Maykov updated CASSANDRA-1867:
---

Attachment: cassandra-1867.txt

Patch against 0.6 branch attached.

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor
 Attachments: cassandra-1867.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12971948#action_12971948
 ] 

Ilya Maykov commented on CASSANDRA-1867:


Oops, the first patch is inserting an extra newline. Will fix momentarily.

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor
 Attachments: cassandra-1867.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Maykov updated CASSANDRA-1867:
---

Attachment: (was: cassandra-1867.txt)

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor
 Attachments: cassandra-1867-2.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Maykov updated CASSANDRA-1867:
---

Comment: was deleted

(was: Fixed patch without the extra newline.)

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor
 Attachments: cassandra-1867-2.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (CASSANDRA-1867) sstable2json runs out of memory when trying to export huge rows

2010-12-15 Thread Ilya Maykov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Maykov updated CASSANDRA-1867:
---

Attachment: cassandra-1867-2.txt

Fixed patch without the extra newline.

 sstable2json runs out of memory when trying to export huge rows
 ---

 Key: CASSANDRA-1867
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1867
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Affects Versions: 0.6.8
Reporter: Ilya Maykov
Priority: Minor
 Attachments: cassandra-1867-2.txt

   Original Estimate: 1h
  Remaining Estimate: 1h

 Currently, sstable2json can run out of memory if it encounters a huge row. 
 The problem is that it creates an in-memory String for each row. Proposed 
 solution is to pass the output PrintStream to the serializeRow() and 
 serializeColumns() methods and write to the stream incrementally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.