Can ypu increase number of partitions and try... Also, i dont think you
need to cache dfs before saving them... U can do away with that as well...

Raghav
On Oct 23, 2015 7:45 AM, "Ram VISWANADHA" <ram.viswana...@dailymotion.com>
wrote:

> Hi ,
> I am trying to load 931MB file into an RDD, then create a DataFrame and
> store the data in a Parquet file. The save method of Parquet file is
> hanging. I have set the timeout to 1800 but still the system fails to
> respond and hangs. I can’t spot any errors in my code. Can someone help me?
> Thanks in advance.
>
> Environment
>
>    1. OS X 10.10.5 with 8G RAM
>    2. JDK 1.8.0_60
>
> Code
>
> final SQLContext sqlContext = new SQLContext(jsc);
> //convert user viewing history to ratings (hash user_id to int)
> JavaRDD<Rating> ratingJavaRDD = createMappedRatingsRDD(jsc);
> //for testing with 2d_full.txt data
> //JavaRDD<Rating> ratingJavaRDD = createMappedRatingRDDFromFile(jsc);
> JavaRDD<Row> ratingRowsRDD = ratingJavaRDD.map(new GenericRowFromRating());
> ratingRowsRDD.cache();
>
> //This line saves the files correctly
>
> ratingJavaRDD.saveAsTextFile("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd");
>
> final DataFrame ratingDF = sqlContext.createDataFrame(ratingRowsRDD,
> getStructTypeForRating());
> ratingDF.registerTempTable("rating_db");
> ratingDF.show();
> ratingDF.cache();
>
> //this line hangs
>
> ratingDF.write().format("parquet").save(
> "file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet"
> );
>
>
> wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-0000*
>
> -rw-r--r--  1 r.viswanadha  staff   785K Oct 22 18:55
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00000
>
> -rw-r--r--  1 r.viswanadha  staff   790K Oct 22 18:55
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00001
>
> -rw-r--r--  1 r.viswanadha  staff   786K Oct 22 18:55
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00002
>
> -rw-r--r--  1 r.viswanadha  staff   796K Oct 22 18:55
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00003
>
> -rw-r--r--  1 r.viswanadha  staff   791K Oct 22 18:55
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00004
>
> wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/
>
> The only thing that is saved is the temporary part file
>
> wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
> /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/task_201510221857_0007_m_000000/
>
> total 336
>
> drwxr-xr-x  4 r.viswanadha  staff   136B Oct 22 18:57 .
>
> drwxr-xr-x  4 r.viswanadha  staff   136B Oct 22 18:57 ..
>
> -rw-r--r--  1 r.viswanadha  staff   1.3K Oct 22 18:57
> .part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet.crc
>
> -rw-r--r--  1 r.viswanadha  staff   163K Oct 22 18:57
> part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet
>
>
> Active Stages (1) Stage Id Description Submitted Duration Tasks:
> Succeeded/Total Input Output Shuffle Read Shuffle Write 7 (kill)
> <http://localhost:4040/stages/stage/kill/?id=7&terminate=true>save at
> Recommender.java:549 <http://localhost:4040/stages/stage?id=7&attempt=0>
> +details
> <http://localhost:4040/storage/rdd?id=15>
>
> 2015/10/22 18:57:15 17 min
> 1/5
> 9.4 MB
> Best Regards,
> Ram
>
>

Reply via email to