[no subject]
Hi, everyone! I've got key,value pair in form of LongWritable, Text, where I used the following code: SparkConf conf = new SparkConf().setAppName(MapReduceFileInput); JavaSparkContext sc = new JavaSparkContext(conf); Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text sourceFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat.class,LongWritable.class,Text.class,confHadoop); Now I want to handle the javapairrdd data from LongWritable, Text to another LongWritable, Text, where the Text content is different. After that, I want to write Text into hdfs in order of LongWritable value. But I don't know how to write mapreduce function in spark using java language. Someone can help me? Sincerely, Missie.
How to write mapreduce programming in spark by using java on user-defined javaPairRDD?
Hi, everyone! I've got key,value pair in form of LongWritable, Text, where I used the following code: SparkConf conf = new SparkConf().setAppName(MapReduceFileInput); JavaSparkContext sc = new JavaSparkContext(conf); Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text sourceFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat.class,LongWritable.class,Text.class,confHadoop); Now I want to handle the javapairrdd data from LongWritable, Text to another LongWritable, Text, where the Text content is different. After that, I want to write Text into hdfs in order of LongWritable value. But I don't know how to write mapreduce function in spark using java language. Someone can help me? Sincerely, Missie.
Re: Programming with java on spark
Hi, Akhil. Thank you for your reply. I tried what you suggested. But it exists the following error. source code is: JavaPairRDDLongWritable,Text distFile=sc.hadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat.class,LongWritable.class,Text.class); while DataInputFormat class is defined as this: class DataInputFormat extends FileInputFormatLongWritable, Text { @Override public RecordReaderLongWritable,Text createRecordReader(InputSplit arg0, TaskAttemptContext arg1) throws IOException, InterruptedException { // TODO Auto-generated method stub return new DataRecordReader(); } } The DataRecordReader class is a class derived from class RecordReaderLongWritable, Text Then I got error as follows: [image: 内嵌图片 1] Then I changed the source code to this. It seems working. Thank you again! Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat.class,LongWritable.class,Text.class,confHadoop); 2015-06-23 15:40 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com: Did you happened to try this? JavaPairRDDInteger, String hadoopFile = sc.hadoopFile( /sigmoid, DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote: Hello, everyone! I'm new in spark. I have already written programs in Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I want to move my codes to spark using java language. The first problem I encountered is how to transform big txt file in local storage to RDD, which is compatible to my program written in hadoop. I found that there are functions in SparkContext which maybe helpful. But I don't know how to use them. E.G. public K,V,F extends org.apache.hadoop.mapreduce.InputFormatK,V RDD http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.htmlscala.Tuple2K,V newAPIHadoopFile(String path, ClassF fClass, ClassK kClass, ClassV vClass, org.apache.hadoop.conf.Configuration conf) Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format. '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function. In java, the following is wrong. /option one Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat,LongWritable,Text,confHadoop); /option two Configuration confHadoop = new Configuration(); DataInputFormat input=new DataInputFormat(); LongWritable longType=new LongWritable(); Text text=new Text(); JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, input,longType,text,confHadoop); Can anyone help me? Thank you so much.
Programming with java on spark
Hello, everyone! I'm new in spark. I have already written programs in Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I want to move my codes to spark using java language. The first problem I encountered is how to transform big txt file in local storage to RDD, which is compatible to my program written in hadoop. I found that there are functions in SparkContext which maybe helpful. But I don't know how to use them. E.G. public K,V,F extends org.apache.hadoop.mapreduce.InputFormatK,V RDD http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.htmlscala.Tuple2K,V newAPIHadoopFile(String path, ClassF fClass, ClassK kClass, ClassV vClass, org.apache.hadoop.conf.Configuration conf) Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format. '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function. In java, the following is wrong. /option one Configuration confHadoop = new Configuration(); JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, DataInputFormat,LongWritable,Text,confHadoop); /option two Configuration confHadoop = new Configuration(); DataInputFormat input=new DataInputFormat(); LongWritable longType=new LongWritable(); Text text=new Text(); JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile( hdfs://cMaster:9000/wcinput/data.txt, input,longType,text,confHadoop); Can anyone help me? Thank you so much.