Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
vidson <a...@santacruzintegration.com>, Pedro Rodriguez <ski.rodrig...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming. > Hi Pedro > > I did some experi

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
Rodriguez <ski.rodrig...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: performance problem when reading lots of small files created by spark streaming. > Hi Pedro > > Thanks for the explanation. I started watching your repo. In the short term I > thi

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Gourav Sengupta
; > > > SaveData(DataFrame df, String path) { > > this.df = df; > > this.path = path; > > } > > } > > static class SaveWorker implements Runnable { > > SaveData data; > > > public SaveWorker(SaveData data) { > > this.data = data; >

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
ta data) { this.data = data; } @Override public void run() { if (data.df.count() >= 1) { data.df.write().json(data.path); } } } } From: Pedro Rodriguez <ski.rodrig...@gmail.com> Date: Wednesday, July 27, 2016 at 8:40 PM To: Andrew Davidson <a...@santacruzintegration.com&

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
There are a few blog posts that detail one possible/likely issue for example: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 TLDR: The hadoop libraries spark uses assumes that its input comes from a file system (works with HDFS) however S3 is a key value store, not a

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson
I have a relatively small data set however it is split into many small JSON files. Each file is between maybe 4K and 400K This is probably a very common issue for anyone using spark streaming. My streaming app works fine, how ever my batch application takes several hours to run. All I am doing