[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-15 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498315#comment-13498315
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

[~brandon.williams] did everything compile okay for you?

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, App.java, loaddata.pl, pom.xml


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach taken in the patch for COF results in only one 
 stream being sent and an exception being thrown when Hadoop is run in local 
 mode due to the call to ConfigHelper when a new BulkRecordWriter is created.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-13 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496379#comment-13496379
 ] 

Brandon Williams commented on CASSANDRA-4912:
-

Do you have an Example.java that contains all the imports?

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-13 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496382#comment-13496382
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

Updated example with imports.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-13 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496392#comment-13496392
 ] 

Brandon Williams commented on CASSANDRA-4912:
-

I still get a slew of errors trying to compile this. An obvious one is in 
ReducerToCassandra.reduce where 'val' is never defined, but there are many 
others.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-13 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496397#comment-13496397
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

yeah sorry wasn't originally intended as a functional example. i'll create one 
that does something now.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-13 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496589#comment-13496589
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

Okay, attached a script to load data (really simple but wanted you to see what 
kind of data I was using to test that the Example job runs) and App.java which 
will allow you to output to multiple column families with BOF.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, App.java, loaddata.pl


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-12 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495591#comment-13495591
 ] 

Brandon Williams commented on CASSANDRA-4912:
-

There is no particular reason that I recall, it was just a convenient place at 
the time.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2
Reporter: Michael Kjellman
 Attachments: 4912.txt, Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-09 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494471#comment-13494471
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

[~brandon.williams] If I patch BulkOutputFormat.java in a similar manner to 
CASSANDRA-4208 (line 40) this is what is causing the initial check of the 
config to pass but fail when the reducer is created. Still not sure why the 
behavior is different.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman
 Attachments: Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-09 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494500#comment-13494500
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

looks like OUTPUT_COLUMNFAMILY_CONFIG never gets set in ConfigHelper when a a 
new BulkRecordWriter is created. Difficult to figure out exactly what 
should/where the code should be setting mapreduce.output.basename in the job 
config.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman
 Attachments: Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-09 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494513#comment-13494513
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

I think also another difference in behavior between CFOF and BOF is that when a 
new BulkRecordWriter(Configuration conf) is created it creates the directory 
for the sstables. It calls ConfigHelper here to get the name of the column 
family so it can create the directory. The only call to getOutputColumnFamily 
is RangeClient in CFOF.

Normally, without MultipleOutputs the job config would include a 
setOutputColumnFamily(). I don't understand what calls setOutputColumnFamily 
when you add a new named MultipleOutput. I presume this is where the problem is.

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman
 Attachments: Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-09 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494542#comment-13494542
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

okay so it looks like setting outputdir in the creation of the object is 
causing the problem. I moved setting outputdir into prepareWriter() and it 
looks like both sstables are created and streamed.

[~brandon.williams] any reason the outputdir is created when the 
BulkRecordWriter object is created?

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman
 Attachments: Example.java


 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-08 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493129#comment-13493129
 ] 

Brandon Williams commented on CASSANDRA-4912:
-

Hmm, normally I find the opposite: local mode works, and then everything breaks 
in distributed mode :)  Can you post everything needed to test?

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman

 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-08 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493577#comment-13493577
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

So when ConfigHelper calls checkOutputSpecs() in local mode when the job is 
setup we don't throw any exceptions. When a reducer is created however 
org.apache.cassandra.hadoop.ConfigHelper.getOutputColumnFamily throws a 
UnsupportedOperationException that the output column family isn't setup. It 
looks like mapreduce.output.basename is null.

Job Config is something along the lines of

public int run(String[] args) throws Exception
{   
Job job = new Job(getConf(), Nashoba);

job.setJarByClass(Nashoba.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(ReducerToCassandra.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);

// setup 3 reducers
job.setNumReduceTasks(3);

// thrift input job settings
ConfigHelper.setInputRpcPort(job.getConfiguration(), 9160);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), 
127.0.0.1);
ConfigHelper.setInputPartitioner(job.getConfiguration(), 
RandomPartitioner);

// thrift output job settings
ConfigHelper.setOutputRpcPort(job.getConfiguration(), 9160);
ConfigHelper.setOutputInitialAddress(job.getConfiguration(), 
127.0.0.1);
ConfigHelper.setOutputPartitioner(job.getConfiguration(), 
RandomPartitioner);

//set timeout to 1 hour for testing
job.getConfiguration().set(mapreduce.task.timeout, 360);
job.getConfiguration().set(mapred.task.timeout, 360);


job.getConfiguration().set(mapreduce.output.bulkoutputformat.buffersize, 
64);
job.setOutputFormatClass(BulkOutputFormat.class);


ConfigHelper.setRangeBatchSize(getConf(), 99);




// let ConfigHelper know what Column Family to get data from 
and where to output it
ConfigHelper.setInputColumnFamily(job.getConfiguration(), 
KEYSPACE, INPUT_COLUMN_FAMILY);

ConfigHelper.setOutputKeyspace(job.getConfiguration(), 
KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, 
BulkOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, 
BulkOutputFormat.class, ByteBuffer.class, List.class);

//what classes the mapper will write and what the consumer 
should expect to recieve
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(MapWritable.class);
job.setOutputKeyClass(ByteBuffer.class);
job.setOutputValueClass(List.class);

SliceRange sliceRange = new SliceRange();
sliceRange.setStart(new bytes[0]);
sliceRange.setFinish(new bytes[0]);
SlicePredicate predicate = new SlicePredicate();
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), 
predicate);

job.waitForCompletion(true);
return 0;
}

public static class ReducerToCassandra extends ReducerText, MapWritable, 
ByteBuffer, ListMutation
{
private MultipleOutputsByteBuffer, ListMutation output;

@Override
public void setup(Context context) {
output = new MultipleOutputsByteBuffer, 
ListMutation(context);
}

public void reduce(Text word, IterableMapWritable values, 
Context context) throws IOException, InterruptedException
{
do stuff in reducer...

//write out our result to Hadoop
context.progress();
//for writing to 2 column families
output.write(OUTPUT_COLUMN_FAMILY1, key, 
Collections.singletonList(getMutation1(word, val)));
output.write(OUTPUT_COLUMN_FAMILY2, key, 
Collections.singletonList(getMutation2(word, val)));
}


public void cleanup(Context context) throws IOException, 
InterruptedException {
output.close(); //closes all of the opened outputs
}

}

 BulkOutputFormat should support Hadoop MultipleOutput
 

[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput

2012-11-07 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492993#comment-13492993
 ] 

Michael Kjellman commented on CASSANDRA-4912:
-

so obviously this is due to the handling in the close() function in 
BulkRecordWriter. So far i've been unable to get BOF to work in Local mode thru 
eclipse with multipleoutput. ConfigHelper is happy on the first check, but when 
the reducer is created the column family output names don't seem to be set. 
close() is pretty simple, looks like the sstable is first closed, and then 
streamed to the nodes. I'm guessing that either close is only being close on 
one of the sstables (i do see in a fully distributed cluster the sstables get 
created for multiple column families) but maybe we don't close it thus it never 
streams to the nodes?

 BulkOutputFormat should support Hadoop MultipleOutput
 -

 Key: CASSANDRA-4912
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Affects Versions: 1.2.0 beta 1
Reporter: Michael Kjellman

 Much like CASSANDRA-4208 BOF should support outputting to Multiple Column 
 Families. The current approach takken in the patch for COF results in only 
 one stream being sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira