[no subject]

2015-07-07 Thread
Hi, everyone!

I've got key,value pair in form of LongWritable, Text, where I used the
following code:

SparkConf conf = new SparkConf().setAppName(MapReduceFileInput);
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration confHadoop = new Configuration();

JavaPairRDDLongWritable,Text sourceFile=sc.newAPIHadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
DataInputFormat.class,LongWritable.class,Text.class,confHadoop);

Now I want to handle the javapairrdd data from LongWritable, Text to
another LongWritable, Text, where the Text content is different. After
that, I want to write Text into hdfs in order of LongWritable value. But I
don't know how to write mapreduce function in spark using java language.
Someone can help me?


Sincerely,
Missie.


How to write mapreduce programming in spark by using java on user-defined javaPairRDD?

2015-07-07 Thread
Hi, everyone!

I've got key,value pair in form of LongWritable, Text, where I used the
following code:

SparkConf conf = new SparkConf().setAppName(MapReduceFileInput);
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration confHadoop = new Configuration();

JavaPairRDDLongWritable,Text sourceFile=sc.newAPIHadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
DataInputFormat.class,LongWritable.class,Text.class,confHadoop);

Now I want to handle the javapairrdd data from LongWritable, Text to
another LongWritable, Text, where the Text content is different. After
that, I want to write Text into hdfs in order of LongWritable value. But I
don't know how to write mapreduce function in spark using java language.
Someone can help me?


Sincerely,
Missie.


Re: Programming with java on spark

2015-06-29 Thread
Hi, Akhil. Thank you for your reply. I tried what you suggested. But it
exists the following error.

source code is:
JavaPairRDDLongWritable,Text distFile=sc.hadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
DataInputFormat.class,LongWritable.class,Text.class);

while DataInputFormat class is defined as this:

class DataInputFormat extends FileInputFormatLongWritable, Text {
@Override
public RecordReaderLongWritable,Text createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
// TODO Auto-generated method stub
return new DataRecordReader();
}
}

The DataRecordReader class is a class derived from
class RecordReaderLongWritable, Text

Then I got error as follows:

[image: 内嵌图片 1]


Then I changed the source code to this. It seems working. Thank you again!
Configuration confHadoop = new Configuration();
JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
DataInputFormat.class,LongWritable.class,Text.class,confHadoop);

2015-06-23 15:40 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com:

 Did you happened to try this?


 JavaPairRDDInteger, String hadoopFile = sc.hadoopFile(
 /sigmoid, DataInputFormat.class, LongWritable.class,
 Text.class)



 Thanks
 Best Regards

 On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote:

 Hello, everyone! I'm new in spark. I have already written programs in
 Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I
 want to move my codes to spark using java language. The first problem I
 encountered is how to transform big txt file in local storage to RDD, which
 is compatible to my program written in hadoop. I found that there are
 functions in SparkContext which maybe helpful. But I don't know how to use
 them.
 E.G.

 public K,V,F extends org.apache.hadoop.mapreduce.InputFormatK,V RDD 
 http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.htmlscala.Tuple2K,V
  newAPIHadoopFile(String path,
ClassF fClass,
ClassK kClass,
ClassV vClass,
  org.apache.hadoop.conf.Configuration conf)

 Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
 and extra configuration options to pass to the input format.

 '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
 object for each record, directly caching the returned RDD or directly
 passing it to an aggregation or shuffle operation will create many
 references to the same object. If you plan to directly cache, sort, or
 aggregate Hadoop writable objects, you should first copy them using a map
  function.
 In java, the following is wrong.

 /option one
 Configuration confHadoop = new Configuration();
 JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile(
 hdfs://cMaster:9000/wcinput/data.txt,
 DataInputFormat,LongWritable,Text,confHadoop);

 /option two
 Configuration confHadoop = new Configuration();
 DataInputFormat input=new DataInputFormat();
 LongWritable longType=new LongWritable();
 Text text=new Text();
 JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile(
 hdfs://cMaster:9000/wcinput/data.txt,
 input,longType,text,confHadoop);

 Can anyone help me? Thank you so much.





Programming with java on spark

2015-06-22 Thread
Hello, everyone! I'm new in spark. I have already written programs in
Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I
want to move my codes to spark using java language. The first problem I
encountered is how to transform big txt file in local storage to RDD, which
is compatible to my program written in hadoop. I found that there are
functions in SparkContext which maybe helpful. But I don't know how to use
them.
E.G.

public K,V,F extends org.apache.hadoop.mapreduce.InputFormatK,V
RDD 
http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.htmlscala.Tuple2K,V
newAPIHadoopFile(String path,
   ClassF fClass,
   ClassK kClass,
   ClassV vClass,
 org.apache.hadoop.conf.Configuration conf)

Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
and extra configuration options to pass to the input format.

'''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each record, directly caching the returned RDD or directly
passing it to an aggregation or shuffle operation will create many
references to the same object. If you plan to directly cache, sort, or
aggregate Hadoop writable objects, you should first copy them using a map
 function.
In java, the following is wrong.

/option one
Configuration confHadoop = new Configuration();
JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
DataInputFormat,LongWritable,Text,confHadoop);

/option two
Configuration confHadoop = new Configuration();
DataInputFormat input=new DataInputFormat();
LongWritable longType=new LongWritable();
Text text=new Text();
JavaPairRDDLongWritable,Text distFile=sc.newAPIHadoopFile(
hdfs://cMaster:9000/wcinput/data.txt,
input,longType,text,confHadoop);

Can anyone help me? Thank you so much.