Opps sorry , wrong mail list From: Andrew Davidson <a...@santacruzintegration.com> Date: Wednesday, July 27, 2016 at 7:10 PM To: "users@kafka.apache.org" <users@kafka.apache.org> Subject: performance problem when reading lots of small files created by spark streaming.
> I have a relatively small data set however it is split into many small JSON > files. Each file is between maybe 4K and 400K > This is probably a very common issue for anyone using spark streaming. My > streaming app works fine, how ever my batch application takes several hours to > run. > > All I am doing is calling count(). Currently I am trying to read the files > from s3. When I look at the app UI it looks like spark is blocked probably on > IO? Adding additional workers and memory does not improve performance. > > I am able to copy the files from s3 to a worker relatively quickly. So I do > not think s3 read time is the problem. > > In the past when I had similar data sets stored on HDFS I was able to use > coalesce() to reduce the number of partition from 200K to 30. This made a big > improvement in processing time. How ever when I read from s3 coalesce() does > not improve performance. > > I tried copying the files to a normal file system and then using hadoop fs > put¹ to copy the files to hdfs how ever this takes several hours and is no > where near completion. It appears hdfs does not deal with small files well. > > I am considering copying the files from s3 to a normal file system on one of > my workers and then concatenating the files into a few much large files, then > using hadoop fs put¹ to move them to hdfs. Do you think this would improve > the spark count() performance issue? > > Does anyone know of heuristics for determining the number or size of the > concatenated files? > > Thanks in advance > > Andy >