subject:"\[jira\] \[Commented\] \(CASSANDRA\-4208\) ColumnFamilyOutputFormat should support writing to multiple column families"

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-11-01 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489225#comment-13489225
 ] 

Jonathan Ellis commented on CASSANDRA-4208:
---

Reverted the BOF change in 78d6f64f33c592890051c690ddf5d26b7b2af027

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-10-31 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488417#comment-13488417
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

Another question. why are we targeting 1.0.2 instead of 1.0.3 in build.xml?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-10-31 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488000#comment-13488000
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

so are we going to revert commit e05a5fc12648f315002c9939a2a0748d74525589 and 
recommit minus the changes in the patch for BOF?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-10-22 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481523#comment-13481523
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

Robbie - I'm okay with that. but not sure then we should have the BOF patch you 
provided applied if it doesn't work. I'm still working on debugging exactly why 
it doesn't stream but getting an environment setup to debug the whole process 
has been difficult.

If anything maybe we should revert the change to BOF keep the other changes and 
then open another BOF bug for multiple output support?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0 beta 2
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-10-22 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481501#comment-13481501
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

[~mkjellman] - I think the BOF support should be in a separate issue, since 
CFOF and BOF don't depend on each other for the MultipleOutputs 
functionality--and because this issue specifically addresses CFOF.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0 beta 2
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-10-18 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479381#comment-13479381
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

Jake or Robbie -- have you tested this with BOF? I've confirmed that it looks 
like this only streams one of the two named multiple outputs.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
>Assignee: Robbie Strickland
> Fix For: 1.2.0 beta 2
>
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-25 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462802#comment-13462802
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

Not a problem.  I'll do so when I get back from Strange Loop... :)

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-25 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462791#comment-13462791
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

Hi Robbie, ready to commit this but the issue is we don't want to change hadoop 
versions on a stable branch 1.1

Could you rebase your patch for trunk?  1.2 should be out soon.



> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460807#comment-13460807
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

You don't need the Hadoop patch to make this work.  I think I'm confused as to 
whether you're having trouble getting this to work at all, or just with BOF.  
As I mentioned I have not tested this with BOF, but it is working against 1.1.x 
& Hadoop 1.0.2 using CFOF.  Look here for an example that works with CFOF: 
https://gist.github.com/3763728.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460800#comment-13460800
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

I applied the patch to Hadoop 1.0.3 as well. Are you suggesting then that for 
now this patch assumes those methods are still private?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460797#comment-13460797
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

[~mkjellman] your usage is correct.  What this patch does is actually change 
the ConfigHelper so set/getColumnFamily() operates on the 
mapreduce.output.basename key that MultipleOutputs (and FileInput/OutputFormat) 
uses when it's looking for outputs.  This is a bit hacky but unavoidable since 
methods to alter this through the Hadoop API are inaccessible.  I have a 
related ticket on the Hadoop side to change this and make it more generic, but 
until then this will have to do. 

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460798#comment-13460798
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

I had already done what your patch contains. Only one SSTable gets created. 
Have you tested that patch? Am i missing something obvious with the job config 
requirements? 

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, 
> trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems 
to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, 
ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with 
job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat 
throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: 
Output directory not set.
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the 
column family name on that job.

additionally, using the job name as the column family name is slightly 
inconvenient as we use '_' in our column family names which is not a valid 
character in MultipleOutputs as it looks like _# is the way they internally 
keep track of counters if that is enabled. 

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460763#comment-13460763
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

You mean BulkOutputFormat isn't working, or MO isn't working at all?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-21 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460745#comment-13460745
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

so i've been working on this for a few days. As far as I can tell this is not 
working with 1.1.5 and 1.0.3. I've gone through and svn blammed and it doesn't 
look like anything exciting has really changed in the mapreduce code. Robbie 
have you tested this on the current GA versions?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-18 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458247#comment-13458247
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

yes- we have it working as well. but so far we have been unsuccessful in 
getting it to work with bulkoutputformat...i'm going to work on debugging that 
today

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-17 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457343#comment-13457343
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

The attached patch works and we have it running in production.  I'm not sure 
why I haven't received any response since May on whether this will be included 
in some future release.  I presume everyone is busy on other features.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-09-17 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457310#comment-13457310
 ] 

Michael Kjellman commented on CASSANDRA-4208:
-

any additional updates on this? Robbie -- what direction did you decide to 
pursue?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, 
> cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-07-11 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411459#comment-13411459
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

I'd like to know if this is going to be included or if another direction is 
preferred.  Any update?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, 
> cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-16 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276874#comment-13276874
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

Any word on this?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, 
> cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-10 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272370#comment-13272370
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

Well, there is always 
http://tutorials.jenkov.com/java-reflection/private-fields-and-methods.html#methods

We use something like this in FBUtilities for accessing protected fields.  

I don't know how much worry a NPE should be, you could just add a log message 
if column family isn't set so people can see it before the NPE and realize they 
did something wrong.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, 
> trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-09 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271761#comment-13271761
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

I'm ok with this now that it works with MultipleOutputs (nice find), though I'm 
not sure if it should be in 1.1 since it would break existing scripts.  Would 
you be able to make it backwards compatible by adding the old constructor back 
and using the setColumnFamily() in there?



> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-09 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271747#comment-13271747
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

Any word on whether this solution is getting the thumbs up?  I personally need 
this functionality and would like to proceed in a manner that will ultimately 
be accepted by the community.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-03 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267524#comment-13267524
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

@Jonathan: Yes that is the patch, although the Hadoop patch is not required as 
long as you have the latest in trunk.  The Hadoop patch just moves the call to 
set the base name out of FileOutputFormat and into OutputFormat--as a matter of 
principle and to avoid potential future issues.

@Jake: Yes it is different. I examined prior branches to see where the changes 
were made, and it's only in trunk--which is why I didn't see it until checking 
out trunk to make the changes.  

It probably makes sense to do a patch against Hadoop 1.0.2 and Cassandra 1.1 so 
people can use a release version.  This is definitely doable without 
significant effort.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-03 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267507#comment-13267507
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

@Robbie is the version in hadoop trunk different than the version included in 
MAPREDUCE-3607?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-03 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267498#comment-13267498
 ] 

Jonathan Ellis commented on CASSANDRA-4208:
---

bq. I am submitting a patch to deal with an inconsistency that could cause 
future issues with non-file formats

On MAPREDUCE-4216 or elsewhere?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-02 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266690#comment-13266690
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

@Robbie can you post your code analysis on the hadoop ticket?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-02 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266685#comment-13266685
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

I would think the Hadoop community would go for it since they already do so 
much to decouple MR from HDFS.

Let's ping them and see what they think, otherwise we could go with the less 
portable solution.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-02 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266626#comment-13266626
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

I spent a good bit of time analyzing the changes needed to make this work using 
MultipleOutputs, and it would involve:

1. Removing hard-coded references to WritableComparable and Writable in 
MultipleOutputs.getNamedOutputKeyClass() and getNamedOutputValueClass().
2. Removing hard-coded call to FileOutputFormat.setOutputName() in 
getRecordWriter().
3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be 
made generic. An alernative is a default no-op implementation so it doesn't 
break existing output formats who don't care about this.
4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set 
the config property for the CF (where the "name" corresponds to CF).
5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() 
is just a pass-through to setOutputName() (or vice versa).

This solution would allow MultipleOutputs support in conformance with the 
existing API, and it should not break any existing reducer code.  I don't 
personally love the boilerplate it adds to my reducer, and I think it's much 
less obvious than handling it at the write() call, but I can get over that if I 
have to. :)  I am willing to do the work on both sides if this is where the 
consensus is, though I don't know what the response will be in the Hadoop 
community.

Thoughts?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266082#comment-13266082
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

My bad. I didn't notice the linked issue.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266032#comment-13266032
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

@Jake: MultipleOutputs is the class we've been referring to in the above posts, 
and it was around pre-1.0. Did you mean to refer to something else?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread T Jake Luciani (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266018#comment-13266018
 ] 

T Jake Luciani commented on CASSANDRA-4208:
---

Could this be accomplished using 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html?

It was recently added to Hadoop 1.0.2

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265989#comment-13265989
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

Looking a bit closer at the MultipleOutputs class, it seems pretty tied to 
FileOutputFormat. So if we go this route we're probably looking at a separate 
CassandraMultipleOutputs with little re-use from MultipleOutputs. We could 
re-use the config keys, but we'd have to duplicate the strings since they're 
private. Am I missing something that makes this more straightforward?

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265963#comment-13265963
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

We could use MultipleOutputs if you think that's better, though the 
implementation is certainly less trivial than what I've done here. Upside is of 
course sticking with the convention. I'm not really sure it gets us any more 
than that, and personally I think it adds unnecessary complexity to an already 
convoluted API. Passing in a CF at the call level is more intuitive and will be 
more familiar to Cassandra users, IMHO. But I'm happy to work on the 
MultipleOutputs version if that's the consensus.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Jonathan Ellis (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265931#comment-13265931
 ] 

Jonathan Ellis commented on CASSANDRA-4208:
---

Are you familiar with the Hadoop MultipleOutputs api?  Seems like that's the 
"right" way to do this.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265914#comment-13265914
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

I should note it would be easy to make this work with previous releases if 
desired.  I think that was your real question... :)

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

2012-05-01 Thread Robbie Strickland (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265911#comment-13265911
 ] 

Robbie Strickland commented on CASSANDRA-4208:
--

There is an API change, so when you do a context.write(), the signature now 
takes in a Pair instead of just a ByteBuffer.  I also 
changed ConfigHelper.setOutputColumnFamily() to setOutputKeyspace() and removed 
CF-related checks and config keys.  It broke my existing reducers, but it's 
also an easy fix and adds tremendous value IMHO.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---
>
> Key: CASSANDRA-4208
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Hadoop
>Affects Versions: 1.1.0
>Reporter: Robbie Strickland
> Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family 
> in a single reducer.  Considering that writing values to Cassandra often 
> involves multiple column families (i.e. updating your index when you insert a 
> new value), this seems overly restrictive.  I am submitting a patch that 
> moves the specification of column family from the job configuration to the 
> write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

37 matches

Mail list logo