In my experience my streaming I was getting tens of thousands of empty files
created in HDFS. This was crushing my systems performance when my batch jobs
ran over the data sets. There is a lot of over head open and closing empty
files.

I think creating empty files or keeping empty partitions around is probably
a bug how ever I never filed a bug report. Please file a bug report. Please
copy me on the Jira

There is also a related performance issue. I use reparation() to ensure CSV
files have a max number of rows. (it an product requirement to make csv
files more user friendly). In my experience if I do not reparation a
partitions with a single row of data would cause a separate part-* file to
be created. I wound out with large number of very small files. I have always
wonder how to configure partitions to get better performance. I would think
we are better off with a few very large partitions in most cases. I.E. Keep
more stuff in memory with less overhead. I was really hoping Spark would
automatically handle this for me

Andy

From:  Mich Talebzadeh <mich.talebza...@gmail.com>
Date:  Tuesday, April 5, 2016 at 3:49 PM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: Saving Spark streaming RDD with saveAsTextFiles ends up
creating empty files on HDFS

> Thanks Andy.
> 
> Do we know if this is a known bug or simply a feature that on the face of it
> Spark cannot save RDD output to a text file?
> 
> 
> 
> Dr Mich Talebzadeh
> 
>  
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8
> Pw 
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV
> 8Pw> 
> 
>  
> 
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
>  
> 
> On 5 April 2016 at 23:35, Andy Davidson <a...@santacruzintegration.com> wrote:
>> Hi Mich
>> 
>> Yup I was surprised to find empty files. Its easy to work around. Note I
>> should probably use coalesce() and not repartition()
>> 
>> In general I found I almost always need to reparation. I was getting
>> thousands of empty partitions. It was really slowing my system down.
>> 
>>    private static void save(JavaDStream<String> json, String outputURIBase) {
>> 
>>         /*
>> 
>>         using saveAsTestFiles will cause lots of empty directories to be
>> created.
>> 
>>         DStream<String> data = json.dstream();
>> 
>>         data.saveAsTextFiles(outputURI, null);
>> 
>>         */      
>> 
>>         
>> 
>>         jsonTweets.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
>> 
>>             private static final long serialVersionUID = 1L;
>> 
>>             @Override
>> 
>>             public void call(JavaRDD<String> rdd, Time time) throws Exception
>> {
>> 
>>                 Long count = rdd.count();
>> 
>>                 //if(!rdd.isEmpty()) {
>> 
>>                 if(count > 0) {
>> 
>>                     rdd = repartition(rdd, count.intValue());
>> 
>>                     long milliSeconds = time.milliseconds();
>> 
>>                     String date =
>> Utils.convertMillisecondsToDateStr(milliSeconds);
>> 
>>                     String dirPath = outputURIBase
>> 
>>                             + File.separator +  date
>> 
>>                             + File.separator + "tweet-" +
>> time.milliseconds();
>> 
>>                     rdd.saveAsTextFile(dirPath);
>> 
>>                 }
>> 
>>             }
>> 
>>             
>> 
>>             final int maxNumRowsPerFile = 200;
>> 
>>             JavaRDD<String> repartition(JavaRDD<String> rdd, int count) {
>> 
>>                 long numPartisions = count / maxNumRowsPerFile + 1;
>> 
>>                 Long tmp = numPartisions;
>> 
>>                 rdd = rdd.repartition(tmp.intValue());
>> 
>>                 return rdd;
>> 
>>             }
>> 
>>         });
>> 
>>      
>> 
>>     }
>> 
>> 
>> 
>> 
>> From:  Mich Talebzadeh <mich.talebza...@gmail.com>
>> Date:  Tuesday, April 5, 2016 at 3:06 PM
>> To:  "user @spark" <user@spark.apache.org>
>> Subject:  Saving Spark streaming RDD with saveAsTextFiles ends up creating
>> empty files on HDFS
>> 
>>> Spark 1.6.1
>>> 
>>> The following creates empty files. It prints lines OK with println
>>> 
>>> val result = lines.filter(_.contains("ASE 15")).flatMap(line =>
>>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
>>> result.saveAsTextFiles("/tmp/rdd_stuff")
>>> 
>>> I am getting zero length files
>>> 
>>> drwxr-xr-x   - hduser supergroup          0 2016-04-05 23:19
>>> /tmp/rdd_stuff-1459894755000
>>> drwxr-xr-x   - hduser supergroup          0 2016-04-05 23:20
>>> /tmp/rdd_stuff-1459894810000
>>> 
>>> Any ideas?
>>> 
>>> Thanks,
>>> 
>>> Dr Mich Talebzadeh
>>> 
>>>  
>>> 
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
>>> V8Pw 
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
>>> rV8Pw> 
>>> 
>>>  
>>> 
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>>  
> 


Reply via email to