I have a relatively small data set however it is split into many small JSON files. Each file is between maybe 4K and 400K This is probably a very common issue for anyone using spark streaming. My streaming app works fine, how ever my batch application takes several hours to run.
All I am doing is calling count(). Currently I am trying to read the files from s3. When I look at the app UI it looks like spark is blocked probably on IO? Adding additional workers and memory does not improve performance. I am able to copy the files from s3 to a worker relatively quickly. So I do not think s3 read time is the problem. In the past when I had similar data sets stored on HDFS I was able to use coalesce() to reduce the number of partition from 200K to 30. This made a big improvement in processing time. How ever when I read from s3 coalesce() does not improve performance. I tried copying the files to a normal file system and then using hadoop fs put¹ to copy the files to hdfs how ever this takes several hours and is no where near completion. It appears hdfs does not deal with small files well. I am considering copying the files from s3 to a normal file system on one of my workers and then concatenating the files into a few much large files, then using hadoop fs put¹ to move them to hdfs. Do you think this would improve the spark count() performance issue? Does anyone know of heuristics for determining the number or size of the concatenated files? Thanks in advance Andy