[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489225#comment-13489225 ] Jonathan Ellis commented on CASSANDRA-4208: --- Reverted the BOF change in 78d6f64f33c592890051c690ddf5d26b7b2af027 > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488417#comment-13488417 ] Michael Kjellman commented on CASSANDRA-4208: - Another question. why are we targeting 1.0.2 instead of 1.0.3 in build.xml? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488000#comment-13488000 ] Michael Kjellman commented on CASSANDRA-4208: - so are we going to revert commit e05a5fc12648f315002c9939a2a0748d74525589 and recommit minus the changes in the patch for BOF? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481523#comment-13481523 ] Michael Kjellman commented on CASSANDRA-4208: - Robbie - I'm okay with that. but not sure then we should have the BOF patch you provided applied if it doesn't work. I'm still working on debugging exactly why it doesn't stream but getting an environment setup to debug the whole process has been difficult. If anything maybe we should revert the change to BOF keep the other changes and then open another BOF bug for multiple output support? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 beta 2 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481501#comment-13481501 ] Robbie Strickland commented on CASSANDRA-4208: -- [~mkjellman] - I think the BOF support should be in a separate issue, since CFOF and BOF don't depend on each other for the MultipleOutputs functionality--and because this issue specifically addresses CFOF. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 beta 2 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479381#comment-13479381 ] Michael Kjellman commented on CASSANDRA-4208: - Jake or Robbie -- have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland >Assignee: Robbie Strickland > Fix For: 1.2.0 beta 2 > > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt, trunk-4208-v3.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462802#comment-13462802 ] Robbie Strickland commented on CASSANDRA-4208: -- Not a problem. I'll do so when I get back from Strange Loop... :) > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462791#comment-13462791 ] T Jake Luciani commented on CASSANDRA-4208: --- Hi Robbie, ready to commit this but the issue is we don't want to change hadoop versions on a stable branch 1.1 Could you rebase your patch for trunk? 1.2 should be out soon. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460807#comment-13460807 ] Robbie Strickland commented on CASSANDRA-4208: -- You don't need the Hadoop patch to make this work. I think I'm confused as to whether you're having trouble getting this to work at all, or just with BOF. As I mentioned I have not tested this with BOF, but it is working against 1.1.x & Hadoop 1.0.2 using CFOF. Look here for an example that works with CFOF: https://gist.github.com/3763728. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460800#comment-13460800 ] Michael Kjellman commented on CASSANDRA-4208: - I applied the patch to Hadoop 1.0.3 as well. Are you suggesting then that for now this patch assumes those methods are still private? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460797#comment-13460797 ] Robbie Strickland commented on CASSANDRA-4208: -- [~mkjellman] your usage is correct. What this patch does is actually change the ConfigHelper so set/getColumnFamily() operates on the mapreduce.output.basename key that MultipleOutputs (and FileInput/OutputFormat) uses when it's looking for outputs. This is a bit hacky but unavoidable since methods to alter this through the Hadoop API are inaccessible. I have a related ticket on the Hadoop side to change this and make it more generic, but until then this will have to do. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460798#comment-13460798 ] Michael Kjellman commented on CASSANDRA-4208: - I had already done what your patch contains. Only one SSTable gets created. Have you tested that patch? Am i missing something obvious with the job config requirements? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, > trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770 ] Michael Kjellman commented on CASSANDRA-4208: - Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family. I would assume: ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class); is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set. at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127) If i do specify that at the job level the job name never seems to to set the column family name on that job. additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460763#comment-13460763 ] Robbie Strickland commented on CASSANDRA-4208: -- You mean BulkOutputFormat isn't working, or MO isn't working at all? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460745#comment-13460745 ] Michael Kjellman commented on CASSANDRA-4208: - so i've been working on this for a few days. As far as I can tell this is not working with 1.1.5 and 1.0.3. I've gone through and svn blammed and it doesn't look like anything exciting has really changed in the mapreduce code. Robbie have you tested this on the current GA versions? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458247#comment-13458247 ] Michael Kjellman commented on CASSANDRA-4208: - yes- we have it working as well. but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457343#comment-13457343 ] Robbie Strickland commented on CASSANDRA-4208: -- The attached patch works and we have it running in production. I'm not sure why I haven't received any response since May on whether this will be included in some future release. I presume everyone is busy on other features. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457310#comment-13457310 ] Michael Kjellman commented on CASSANDRA-4208: - any additional updates on this? Robbie -- what direction did you decide to pursue? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, > cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411459#comment-13411459 ] Robbie Strickland commented on CASSANDRA-4208: -- I'd like to know if this is going to be included or if another direction is preferred. Any update? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, > cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276874#comment-13276874 ] Robbie Strickland commented on CASSANDRA-4208: -- Any word on this? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, > cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272370#comment-13272370 ] T Jake Luciani commented on CASSANDRA-4208: --- Well, there is always http://tutorials.jenkov.com/java-reflection/private-fields-and-methods.html#methods We use something like this in FBUtilities for accessing protected fields. I don't know how much worry a NPE should be, you could just add a log message if column family isn't set so people can see it before the NPE and realize they did something wrong. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, > trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271761#comment-13271761 ] T Jake Luciani commented on CASSANDRA-4208: --- I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts. Would you be able to make it backwards compatible by adding the old constructor back and using the setColumnFamily() in there? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271747#comment-13271747 ] Robbie Strickland commented on CASSANDRA-4208: -- Any word on whether this solution is getting the thumbs up? I personally need this functionality and would like to proceed in a manner that will ultimately be accepted by the community. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267524#comment-13267524 ] Robbie Strickland commented on CASSANDRA-4208: -- @Jonathan: Yes that is the patch, although the Hadoop patch is not required as long as you have the latest in trunk. The Hadoop patch just moves the call to set the base name out of FileOutputFormat and into OutputFormat--as a matter of principle and to avoid potential future issues. @Jake: Yes it is different. I examined prior branches to see where the changes were made, and it's only in trunk--which is why I didn't see it until checking out trunk to make the changes. It probably makes sense to do a patch against Hadoop 1.0.2 and Cassandra 1.1 so people can use a release version. This is definitely doable without significant effort. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267507#comment-13267507 ] T Jake Luciani commented on CASSANDRA-4208: --- @Robbie is the version in hadoop trunk different than the version included in MAPREDUCE-3607? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267498#comment-13267498 ] Jonathan Ellis commented on CASSANDRA-4208: --- bq. I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats On MAPREDUCE-4216 or elsewhere? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208-v2.txt, trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266690#comment-13266690 ] T Jake Luciani commented on CASSANDRA-4208: --- @Robbie can you post your code analysis on the hadoop ticket? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266685#comment-13266685 ] T Jake Luciani commented on CASSANDRA-4208: --- I would think the Hadoop community would go for it since they already do so much to decouple MR from HDFS. Let's ping them and see what they think, otherwise we could go with the less portable solution. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266626#comment-13266626 ] Robbie Strickland commented on CASSANDRA-4208: -- I spent a good bit of time analyzing the changes needed to make this work using MultipleOutputs, and it would involve: 1. Removing hard-coded references to WritableComparable and Writable in MultipleOutputs.getNamedOutputKeyClass() and getNamedOutputValueClass(). 2. Removing hard-coded call to FileOutputFormat.setOutputName() in getRecordWriter(). 3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be made generic. An alernative is a default no-op implementation so it doesn't break existing output formats who don't care about this. 4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set the config property for the CF (where the "name" corresponds to CF). 5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() is just a pass-through to setOutputName() (or vice versa). This solution would allow MultipleOutputs support in conformance with the existing API, and it should not break any existing reducer code. I don't personally love the boilerplate it adds to my reducer, and I think it's much less obvious than handling it at the write() call, but I can get over that if I have to. :) I am willing to do the work on both sides if this is where the consensus is, though I don't know what the response will be in the Hadoop community. Thoughts? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266082#comment-13266082 ] T Jake Luciani commented on CASSANDRA-4208: --- My bad. I didn't notice the linked issue. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266032#comment-13266032 ] Robbie Strickland commented on CASSANDRA-4208: -- @Jake: MultipleOutputs is the class we've been referring to in the above posts, and it was around pre-1.0. Did you mean to refer to something else? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266018#comment-13266018 ] T Jake Luciani commented on CASSANDRA-4208: --- Could this be accomplished using http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html? It was recently added to Hadoop 1.0.2 > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265989#comment-13265989 ] Robbie Strickland commented on CASSANDRA-4208: -- Looking a bit closer at the MultipleOutputs class, it seems pretty tied to FileOutputFormat. So if we go this route we're probably looking at a separate CassandraMultipleOutputs with little re-use from MultipleOutputs. We could re-use the config keys, but we'd have to duplicate the strings since they're private. Am I missing something that makes this more straightforward? > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265963#comment-13265963 ] Robbie Strickland commented on CASSANDRA-4208: -- We could use MultipleOutputs if you think that's better, though the implementation is certainly less trivial than what I've done here. Upside is of course sticking with the convention. I'm not really sure it gets us any more than that, and personally I think it adds unnecessary complexity to an already convoluted API. Passing in a CF at the call level is more intuitive and will be more familiar to Cassandra users, IMHO. But I'm happy to work on the MultipleOutputs version if that's the consensus. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265931#comment-13265931 ] Jonathan Ellis commented on CASSANDRA-4208: --- Are you familiar with the Hadoop MultipleOutputs api? Seems like that's the "right" way to do this. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265914#comment-13265914 ] Robbie Strickland commented on CASSANDRA-4208: -- I should note it would be easy to make this work with previous releases if desired. I think that was your real question... :) > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
[ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265911#comment-13265911 ] Robbie Strickland commented on CASSANDRA-4208: -- There is an API change, so when you do a context.write(), the signature now takes in a Pair instead of just a ByteBuffer. I also changed ConfigHelper.setOutputColumnFamily() to setOutputKeyspace() and removed CF-related checks and config keys. It broke my existing reducers, but it's also an easy fix and adds tremendous value IMHO. > ColumnFamilyOutputFormat should support writing to multiple column families > --- > > Key: CASSANDRA-4208 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4208 > Project: Cassandra > Issue Type: Improvement > Components: Hadoop >Affects Versions: 1.1.0 >Reporter: Robbie Strickland > Attachments: trunk-4208.txt > > > It is not currently possible to output records to more than one column family > in a single reducer. Considering that writing values to Cassandra often > involves multiple column families (i.e. updating your index when you insert a > new value), this seems overly restrictive. I am submitting a patch that > moves the specification of column family from the job configuration to the > write() call in ColumnFamilyRecordWriter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira