Hi Chen,

You need to set the max input split size so that the underlying hadoop
libraries will calculate the splits appropriately.  I have done the
following successfully:

val job = new Job()
FileInputFormat.setMaxInputSplitSize(job, 128000000L)

And then use job.getConfiguration when creating a NewHadoopRDD.

I am sure there is some way to use it with convenience methods like
SparkContext.textFile, you could probably set the system property
"mapreduce.input.fileinputformat.split.maxsize".

Regards,
Paul Hamilton

From:  Chen Song <chen.song...@gmail.com>
Date:  Friday, August 8, 2014 at 9:13 PM
To:  "user@spark.apache.org" <user@spark.apache.org>
Subject:  increase parallelism of reading from hdfs


In Spark Streaming, StreamContext.fileStream gives a FileInputDStream.
Within each batch interval, it would launch map tasks for the new files
detected during that interval. It appears that the way Spark compute the
number of map tasks is based
 oo block size of files.

Below is the quote from Spark documentation.

 Spark automatically sets the number of ³map² tasks to run on each file
according to its size (though you can control
 it through optional parameters to SparkContext.textFile,
 etc)

In my testing, if files are loaded as 512M blocks, each map task seems to
process 512M chunk of data, no matter what value I set dfs.blocksize on
driver/executor. I am wondering if there is a way to increase parallelism,
say let each map read 128M data
 and increase the number of map tasks?


-- 
Chen Song


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to