[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13498315#comment-13498315 ] Michael Kjellman commented on CASSANDRA-4912: - [~brandon.williams] did everything compile okay for you? BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, App.java, loaddata.pl, pom.xml Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach taken in the patch for COF results in only one stream being sent and an exception being thrown when Hadoop is run in local mode due to the call to ConfigHelper when a new BulkRecordWriter is created. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496379#comment-13496379 ] Brandon Williams commented on CASSANDRA-4912: - Do you have an Example.java that contains all the imports? BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496382#comment-13496382 ] Michael Kjellman commented on CASSANDRA-4912: - Updated example with imports. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496392#comment-13496392 ] Brandon Williams commented on CASSANDRA-4912: - I still get a slew of errors trying to compile this. An obvious one is in ReducerToCassandra.reduce where 'val' is never defined, but there are many others. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496397#comment-13496397 ] Michael Kjellman commented on CASSANDRA-4912: - yeah sorry wasn't originally intended as a functional example. i'll create one that does something now. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496589#comment-13496589 ] Michael Kjellman commented on CASSANDRA-4912: - Okay, attached a script to load data (really simple but wanted you to see what kind of data I was using to test that the Example job runs) and App.java which will allow you to output to multiple column families with BOF. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, App.java, loaddata.pl Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495591#comment-13495591 ] Brandon Williams commented on CASSANDRA-4912: - There is no particular reason that I recall, it was just a convenient place at the time. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1, 1.2.0 beta 2 Reporter: Michael Kjellman Attachments: 4912.txt, Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494471#comment-13494471 ] Michael Kjellman commented on CASSANDRA-4912: - [~brandon.williams] If I patch BulkOutputFormat.java in a similar manner to CASSANDRA-4208 (line 40) this is what is causing the initial check of the config to pass but fail when the reducer is created. Still not sure why the behavior is different. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Attachments: Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494500#comment-13494500 ] Michael Kjellman commented on CASSANDRA-4912: - looks like OUTPUT_COLUMNFAMILY_CONFIG never gets set in ConfigHelper when a a new BulkRecordWriter is created. Difficult to figure out exactly what should/where the code should be setting mapreduce.output.basename in the job config. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Attachments: Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494513#comment-13494513 ] Michael Kjellman commented on CASSANDRA-4912: - I think also another difference in behavior between CFOF and BOF is that when a new BulkRecordWriter(Configuration conf) is created it creates the directory for the sstables. It calls ConfigHelper here to get the name of the column family so it can create the directory. The only call to getOutputColumnFamily is RangeClient in CFOF. Normally, without MultipleOutputs the job config would include a setOutputColumnFamily(). I don't understand what calls setOutputColumnFamily when you add a new named MultipleOutput. I presume this is where the problem is. BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Attachments: Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494542#comment-13494542 ] Michael Kjellman commented on CASSANDRA-4912: - okay so it looks like setting outputdir in the creation of the object is causing the problem. I moved setting outputdir into prepareWriter() and it looks like both sstables are created and streamed. [~brandon.williams] any reason the outputdir is created when the BulkRecordWriter object is created? BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Attachments: Example.java Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493129#comment-13493129 ] Brandon Williams commented on CASSANDRA-4912: - Hmm, normally I find the opposite: local mode works, and then everything breaks in distributed mode :) Can you post everything needed to test? BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493577#comment-13493577 ] Michael Kjellman commented on CASSANDRA-4912: - So when ConfigHelper calls checkOutputSpecs() in local mode when the job is setup we don't throw any exceptions. When a reducer is created however org.apache.cassandra.hadoop.ConfigHelper.getOutputColumnFamily throws a UnsupportedOperationException that the output column family isn't setup. It looks like mapreduce.output.basename is null. Job Config is something along the lines of public int run(String[] args) throws Exception { Job job = new Job(getConf(), Nashoba); job.setJarByClass(Nashoba.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(ReducerToCassandra.class); job.setInputFormatClass(ColumnFamilyInputFormat.class); // setup 3 reducers job.setNumReduceTasks(3); // thrift input job settings ConfigHelper.setInputRpcPort(job.getConfiguration(), 9160); ConfigHelper.setInputInitialAddress(job.getConfiguration(), 127.0.0.1); ConfigHelper.setInputPartitioner(job.getConfiguration(), RandomPartitioner); // thrift output job settings ConfigHelper.setOutputRpcPort(job.getConfiguration(), 9160); ConfigHelper.setOutputInitialAddress(job.getConfiguration(), 127.0.0.1); ConfigHelper.setOutputPartitioner(job.getConfiguration(), RandomPartitioner); //set timeout to 1 hour for testing job.getConfiguration().set(mapreduce.task.timeout, 360); job.getConfiguration().set(mapred.task.timeout, 360); job.getConfiguration().set(mapreduce.output.bulkoutputformat.buffersize, 64); job.setOutputFormatClass(BulkOutputFormat.class); ConfigHelper.setRangeBatchSize(getConf(), 99); // let ConfigHelper know what Column Family to get data from and where to output it ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, INPUT_COLUMN_FAMILY); ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, BulkOutputFormat.class, ByteBuffer.class, List.class); MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, BulkOutputFormat.class, ByteBuffer.class, List.class); //what classes the mapper will write and what the consumer should expect to recieve job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(MapWritable.class); job.setOutputKeyClass(ByteBuffer.class); job.setOutputValueClass(List.class); SliceRange sliceRange = new SliceRange(); sliceRange.setStart(new bytes[0]); sliceRange.setFinish(new bytes[0]); SlicePredicate predicate = new SlicePredicate(); predicate.setSlice_range(sliceRange); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); job.waitForCompletion(true); return 0; } public static class ReducerToCassandra extends ReducerText, MapWritable, ByteBuffer, ListMutation { private MultipleOutputsByteBuffer, ListMutation output; @Override public void setup(Context context) { output = new MultipleOutputsByteBuffer, ListMutation(context); } public void reduce(Text word, IterableMapWritable values, Context context) throws IOException, InterruptedException { do stuff in reducer... //write out our result to Hadoop context.progress(); //for writing to 2 column families output.write(OUTPUT_COLUMN_FAMILY1, key, Collections.singletonList(getMutation1(word, val))); output.write(OUTPUT_COLUMN_FAMILY2, key, Collections.singletonList(getMutation2(word, val))); } public void cleanup(Context context) throws IOException, InterruptedException { output.close(); //closes all of the opened outputs } } BulkOutputFormat should support Hadoop MultipleOutput
[jira] [Commented] (CASSANDRA-4912) BulkOutputFormat should support Hadoop MultipleOutput
[ https://issues.apache.org/jira/browse/CASSANDRA-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13492993#comment-13492993 ] Michael Kjellman commented on CASSANDRA-4912: - so obviously this is due to the handling in the close() function in BulkRecordWriter. So far i've been unable to get BOF to work in Local mode thru eclipse with multipleoutput. ConfigHelper is happy on the first check, but when the reducer is created the column family output names don't seem to be set. close() is pretty simple, looks like the sstable is first closed, and then streamed to the nodes. I'm guessing that either close is only being close on one of the sstables (i do see in a fully distributed cluster the sstables get created for multiple column families) but maybe we don't close it thus it never streams to the nodes? BulkOutputFormat should support Hadoop MultipleOutput - Key: CASSANDRA-4912 URL: https://issues.apache.org/jira/browse/CASSANDRA-4912 Project: Cassandra Issue Type: New Feature Components: Hadoop Affects Versions: 1.2.0 beta 1 Reporter: Michael Kjellman Much like CASSANDRA-4208 BOF should support outputting to Multiple Column Families. The current approach takken in the patch for COF results in only one stream being sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira