Opps sorry , wrong mail list

From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Wednesday, July 27, 2016 at 7:10 PM
To:  "users@kafka.apache.org" <users@kafka.apache.org>
Subject:  performance problem when reading lots of small files created by
spark streaming.

> I have a relatively small data set however it is split into many small JSON
> files. Each file is between maybe 4K and 400K
> This is probably a very common issue for anyone using spark streaming. My
> streaming app works fine, how ever my batch application takes several hours to
> run. 
> 
> All I am doing is calling count(). Currently I am trying to read the files
> from s3. When I look at the app UI it looks like spark is blocked probably on
> IO? Adding additional workers and memory does not improve performance.
> 
> I am able to copy the files from s3 to a worker relatively quickly. So I do
> not think s3 read time is the problem.
> 
> In the past when I had similar data sets stored on HDFS I was able to use
> coalesce() to reduce the number of partition from 200K to 30. This made a big
> improvement in processing time. How ever when I read from s3 coalesce() does
> not improve performance.
> 
> I tried copying the files to a normal file system and then using Œhadoop fs
> put¹ to copy the files to hdfs how ever this takes several hours and is no
> where near completion. It appears hdfs does not deal with small files well.
> 
> I am considering copying the files from s3 to a normal file system on one of
> my workers and then concatenating the files into a few much large files, then
> using Œhadoop fs put¹ to move them to hdfs. Do you think this would improve
> the spark count() performance issue?
> 
> Does anyone know of heuristics for determining the number or size of the
> concatenated files?
> 
> Thanks in advance
> 
> Andy
> 


Reply via email to