Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

Mich Talebzadeh Tue, 05 Apr 2016 15:49:59 -0700

Thanks Andy.

Do we know if this is a known bug or simply a feature that on the face of
it Spark cannot save RDD output to a text file?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 5 April 2016 at 23:35, Andy Davidson <a...@santacruzintegration.com>
wrote:

> Hi Mich
>
> Yup I was surprised to find empty files. Its easy to work around. Note I
> should probably use coalesce() and not repartition()
>
> In general I found I almost always need to reparation. I was getting
> thousands of empty partitions. It was really slowing my system down.
>
>    private static void save(JavaDStream<String> json, String outputURIBase)
> {
>
>         /*
>
>         using saveAsTestFiles will cause lots of empty directories to be
> created.
>
>         DStream<String> data = json.dstream();
>
>         data.saveAsTextFiles(outputURI, null);
>
>         */
>
>
>
>         jsonTweets.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
>
>             private static final long serialVersionUID = 1L;
>
>             @Override
>
>             public void call(JavaRDD<String> rdd, Time time) throws
> Exception {
>
>                 Long count = rdd.count();
>
>                 //if(!rdd.isEmpty()) {
>
>                 if(count > 0) {
>
>                     rdd = repartition(rdd, count.intValue());
>
>                     long milliSeconds = time.milliseconds();
>
>                     String date = Utils.convertMillisecondsToDateStr(
> milliSeconds);
>
>                     String dirPath = outputURIBase
>
>                             + File.separator +  date
>
>                             + File.separator + "tweet-" + time
> .milliseconds();
>
>                     rdd.saveAsTextFile(dirPath);
>
>                 }
>
>             }
>
>
>
>             final int maxNumRowsPerFile = 200;
>
>             JavaRDD<String> repartition(JavaRDD<String> rdd, int count) {
>
>
>                 long numPartisions = count / maxNumRowsPerFile + 1;
>
>                 Long tmp = numPartisions;
>
>                 rdd = rdd.repartition(tmp.intValue());
>
>                 return rdd;
>
>             }
>
>         });
>
>
>
>     }
>
>
>
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: Tuesday, April 5, 2016 at 3:06 PM
> To: "user @spark" <user@spark.apache.org>
> Subject: Saving Spark streaming RDD with saveAsTextFiles ends up creating
> empty files on HDFS
>
> Spark 1.6.1
>
> The following creates empty files. It prints lines OK with println
>
> val result = lines.filter(_.contains("ASE 15")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
> result.saveAsTextFiles("/tmp/rdd_stuff")
>
> I am getting zero length files
>
> drwxr-xr-x   - hduser supergroup          0 2016-04-05 23:19
> /tmp/rdd_stuff-1459894755000
> drwxr-xr-x   - hduser supergroup          0 2016-04-05 23:20
> /tmp/rdd_stuff-1459894810000
>
> Any ideas?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

Reply via email to