[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: trunk-4208-v3.txt I've attached the new patch (v3) rebased against trunk. ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: cassandra-1.1-4208-v4.txt I've attached a new patch that removes the check for a null output CF on BulkOutputFormat. This allows BOF to use the MultipleOutputs API. ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: cassandra-1.1-4208-v2.txt I've attached a patch that adds a setOutputColumnFamily() overload that takes in both keyspace and CF. The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs. Unfortunately MultipleOutputs.getNamedOutputsList()--which would be the right way to do this--is currently private. So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves. Not sure which is the lesser of two evils. ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: cassandra-1.1-4208.txt It appears I was mistaken about the MultipleOutputs issue being resolved only in trunk. It's resolved in the mapred package in trunk, but the new version in mapreduce dates at least back to 1.0.1. It still references FileOutputFormat, but the attached patch gets around this by using the same config key. I have attached a new patch based against Cassandra 1.1 and Hadoop 1.0.2. Changes are actually minimal. Let me know your thoughts... ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: trunk-4208-v2.txt I've added a patch to allow support for MultipleOutputs. Hadoop trunk now contains a new version of MultipleOutputs that should support this out of the box, although I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats. The basic solution involves changing the config key for output CF to match the basename key being written by MultipleOutputs. I had to make related changes to CassandraStorage and TestRingCache, as well as some minor changes to ColumnFamilyInputFormat to account for some interface changes in Hadoop trunk. So the bottom line is this will work if people use Hadoop and Cassandra trunk with both patches applied. The original patch can be used as a temporary solution if needed. ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: trunk-4208-v2.txt, trunk-4208.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Attachment: trunk-4208.txt ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: trunk-4208.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robbie Strickland updated CASSANDRA-4208: - Comment: was deleted (was: I created an issue regarding the specificity of MultipleOutputs to FileOutputFormat. Linked here as an FYI.) ColumnFamilyOutputFormat should support writing to multiple column families --- Key: CASSANDRA-4208 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 Project: Cassandra Issue Type: Improvement Components: Hadoop Affects Versions: 1.1.0 Reporter: Robbie Strickland Attachments: trunk-4208.txt It is not currently possible to output records to more than one column family in a single reducer. Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive. I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira