Thanks Paul. I will give a try.

On Mon, Aug 11, 2014 at 1:11 PM, Paul Hamilton <paul.hamil...@datalogix.com>
wrote:

> Hi Chen,
>
> You need to set the max input split size so that the underlying hadoop
> libraries will calculate the splits appropriately.  I have done the
> following successfully:
>
> val job = new Job()
> FileInputFormat.setMaxInputSplitSize(job, 128000000L)
>
> And then use job.getConfiguration when creating a NewHadoopRDD.
>
> I am sure there is some way to use it with convenience methods like
> SparkContext.textFile, you could probably set the system property
> "mapreduce.input.fileinputformat.split.maxsize".
>
> Regards,
> Paul Hamilton
>
> From:  Chen Song <chen.song...@gmail.com>
> Date:  Friday, August 8, 2014 at 9:13 PM
> To:  "user@spark.apache.org" <user@spark.apache.org>
> Subject:  increase parallelism of reading from hdfs
>
>
> In Spark Streaming, StreamContext.fileStream gives a FileInputDStream.
> Within each batch interval, it would launch map tasks for the new files
> detected during that interval. It appears that the way Spark compute the
> number of map tasks is based
>  oo block size of files.
>
> Below is the quote from Spark documentation.
>
>  Spark automatically sets the number of ³map² tasks to run on each file
> according to its size (though you can control
>  it through optional parameters to SparkContext.textFile,
>  etc)
>
> In my testing, if files are loaded as 512M blocks, each map task seems to
> process 512M chunk of data, no matter what value I set dfs.blocksize on
> driver/executor. I am wondering if there is a way to increase parallelism,
> say let each map read 128M data
>  and increase the number of map tasks?
>
>
> --
> Chen Song
>
>


-- 
Chen Song

Reply via email to