I have a relatively small data set however it is split into many small JSON
files. Each file is between maybe 4K and 400K
This is probably a very common issue for anyone using spark streaming. My
streaming app works fine, how ever my batch application takes several hours
to run. 

All I am doing is calling count(). Currently I am trying to read the files
from s3. When I look at the app UI it looks like spark is blocked probably
on IO? Adding additional workers and memory does not improve performance.

I am able to copy the files from s3 to a worker relatively quickly. So I do
not think s3 read time is the problem.

In the past when I had similar data sets stored on HDFS I was able to use
coalesce() to reduce the number of partition from 200K to 30. This made a
big improvement in processing time. How ever when I read from s3 coalesce()
does not improve performance.

I tried copying the files to a normal file system and then using Œhadoop fs
put¹ to copy the files to hdfs how ever this takes several hours and is no
where near completion. It appears hdfs does not deal with small files well.

I am considering copying the files from s3 to a normal file system on one of
my workers and then concatenating the files into a few much large files,
then using Œhadoop fs put¹ to move them to hdfs. Do you think this would
improve the spark count() performance issue?

Does anyone know of heuristics for determining the number or size of the
concatenated files?

Thanks in advance

Andy



Reply via email to