vidson <a...@santacruzintegration.com>, Pedro Rodriguez
<ski.rodrig...@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: use big files and read from HDFS was: performance problem when
reading lots of small files created by spark streaming.
> Hi Pedro
>
> I did some experi
Rodriguez <ski.rodrig...@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: performance problem when reading lots of small files created
by spark streaming.
> Hi Pedro
>
> Thanks for the explanation. I started watching your repo. In the short term I
> thi
;
>
>
> SaveData(DataFrame df, String path) {
>
> this.df = df;
>
> this.path = path;
>
> }
>
> }
>
> static class SaveWorker implements Runnable {
>
> SaveData data;
>
>
> public SaveWorker(SaveData data) {
>
> this.data = data;
>
ta data) {
this.data = data;
}
@Override
public void run() {
if (data.df.count() >= 1) {
data.df.write().json(data.path);
}
}
}
}
From: Pedro Rodriguez <ski.rodrig...@gmail.com>
Date: Wednesday, July 27, 2016 at 8:40 PM
To: Andrew Davidson <a...@santacruzintegration.com&
There are a few blog posts that detail one possible/likely issue for
example:
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
TLDR: The hadoop libraries spark uses assumes that its input comes from a
file system (works with HDFS) however S3 is a key value store, not a
I have a relatively small data set however it is split into many small JSON
files. Each file is between maybe 4K and 400K
This is probably a very common issue for anyone using spark streaming. My
streaming app works fine, how ever my batch application takes several hours
to run.
All I am doing