Thanks Paul. I will give a try.
On Mon, Aug 11, 2014 at 1:11 PM, Paul Hamilton <paul.hamil...@datalogix.com> wrote: > Hi Chen, > > You need to set the max input split size so that the underlying hadoop > libraries will calculate the splits appropriately. I have done the > following successfully: > > val job = new Job() > FileInputFormat.setMaxInputSplitSize(job, 128000000L) > > And then use job.getConfiguration when creating a NewHadoopRDD. > > I am sure there is some way to use it with convenience methods like > SparkContext.textFile, you could probably set the system property > "mapreduce.input.fileinputformat.split.maxsize". > > Regards, > Paul Hamilton > > From: Chen Song <chen.song...@gmail.com> > Date: Friday, August 8, 2014 at 9:13 PM > To: "user@spark.apache.org" <user@spark.apache.org> > Subject: increase parallelism of reading from hdfs > > > In Spark Streaming, StreamContext.fileStream gives a FileInputDStream. > Within each batch interval, it would launch map tasks for the new files > detected during that interval. It appears that the way Spark compute the > number of map tasks is based > oo block size of files. > > Below is the quote from Spark documentation. > > Spark automatically sets the number of ³map² tasks to run on each file > according to its size (though you can control > it through optional parameters to SparkContext.textFile, > etc) > > In my testing, if files are loaded as 512M blocks, each map task seems to > process 512M chunk of data, no matter what value I set dfs.blocksize on > driver/executor. I am wondering if there is a way to increase parallelism, > say let each map read 128M data > and increase the number of map tasks? > > > -- > Chen Song > > -- Chen Song