Can ypu increase number of partitions and try... Also, i dont think you need to cache dfs before saving them... U can do away with that as well...
Raghav On Oct 23, 2015 7:45 AM, "Ram VISWANADHA" <ram.viswana...@dailymotion.com> wrote: > Hi , > I am trying to load 931MB file into an RDD, then create a DataFrame and > store the data in a Parquet file. The save method of Parquet file is > hanging. I have set the timeout to 1800 but still the system fails to > respond and hangs. I can’t spot any errors in my code. Can someone help me? > Thanks in advance. > > Environment > > 1. OS X 10.10.5 with 8G RAM > 2. JDK 1.8.0_60 > > Code > > final SQLContext sqlContext = new SQLContext(jsc); > //convert user viewing history to ratings (hash user_id to int) > JavaRDD<Rating> ratingJavaRDD = createMappedRatingsRDD(jsc); > //for testing with 2d_full.txt data > //JavaRDD<Rating> ratingJavaRDD = createMappedRatingRDDFromFile(jsc); > JavaRDD<Row> ratingRowsRDD = ratingJavaRDD.map(new GenericRowFromRating()); > ratingRowsRDD.cache(); > > //This line saves the files correctly > > ratingJavaRDD.saveAsTextFile("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd"); > > final DataFrame ratingDF = sqlContext.createDataFrame(ratingRowsRDD, > getStructTypeForRating()); > ratingDF.registerTempTable("rating_db"); > ratingDF.show(); > ratingDF.cache(); > > //this line hangs > > ratingDF.write().format("parquet").save( > "file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet" > ); > > > wks-195:rec-spark-java-poc r.viswanadha$ ls -lah > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-0000* > > -rw-r--r-- 1 r.viswanadha staff 785K Oct 22 18:55 > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00000 > > -rw-r--r-- 1 r.viswanadha staff 790K Oct 22 18:55 > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00001 > > -rw-r--r-- 1 r.viswanadha staff 786K Oct 22 18:55 > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00002 > > -rw-r--r-- 1 r.viswanadha staff 796K Oct 22 18:55 > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00003 > > -rw-r--r-- 1 r.viswanadha staff 791K Oct 22 18:55 > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00004 > > wks-195:rec-spark-java-poc r.viswanadha$ ls -lah > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/ > > The only thing that is saved is the temporary part file > > wks-195:rec-spark-java-poc r.viswanadha$ ls -lah > /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/task_201510221857_0007_m_000000/ > > total 336 > > drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 . > > drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 .. > > -rw-r--r-- 1 r.viswanadha staff 1.3K Oct 22 18:57 > .part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet.crc > > -rw-r--r-- 1 r.viswanadha staff 163K Oct 22 18:57 > part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet > > > Active Stages (1) Stage Id Description Submitted Duration Tasks: > Succeeded/Total Input Output Shuffle Read Shuffle Write 7 (kill) > <http://localhost:4040/stages/stage/kill/?id=7&terminate=true>save at > Recommender.java:549 <http://localhost:4040/stages/stage?id=7&attempt=0> > +details > <http://localhost:4040/storage/rdd?id=15> > > 2015/10/22 18:57:15 17 min > 1/5 > 9.4 MB > Best Regards, > Ram > >