Thanks. Repartitioning to a smaller number of partitions seems to fix my issue, but I'll keep broadcasting in mind (droprows is an integer array with about 4 million entries).
On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver <philip.wea...@gmail.com> wrote: > How big is droprows? > > Try explicitly broadcasting it like this: > > val broadcastDropRows = sc.broadcast(dropRows) > > val valsrows = ... > .filter(x => !broadcastDropRows.value.contains(x._1)) > > - Philip > > > On Wed, Aug 5, 2015 at 11:54 AM, AlexG <swift...@gmail.com> wrote: > >> I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a >> matrix given as A_{ij} = v so I can convert it to a Parquet file. Only >> some >> of the rows of A are relevant, so the following code first loads the >> triplets are text, splits them into Tuple3[Int, Int, Double], drops >> triplets >> whose rows should be skipped, then forms a Tuple2[Int, List[Tuple2[Int, >> Double]]] for each row (if I'm judging datatypes correctly). >> >> val valsrows = sc.textFile(valsinpath).map(_.split(",")). >> map(x => (x(1).toInt, (x(0).toInt, >> x(2).toDouble))). >> filter(x => !droprows.contains(x._1)). >> groupByKey. >> map(x => (x._1, x._2.toSeq.sortBy(_._1))) >> >> Spark hangs during a broadcast that occurs during the filter step >> (according >> to the Spark UI). The last two lines in the log before it pauses are: >> >> 5/08/05 18:50:10 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 >> in >> memory on 172.31.49.149:37643 (size: 4.6 KB, free: 113.8 GB) >> 15/08/05 18:50:10 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 >> in >> memory on 172.31.49.159:41846 (size: 4.6 KB, free: 113.8 GB) >> >> I've left Spark running for up to 17 minutes one time, and it never >> continues past this point. I'm using a cluster of 30 r3.8xlarge EC2 >> instances (244Gb, 32 cores) with spark in standalone mode with 220G >> executor >> and driver memory, and using the kyroserializer. >> >> Any ideas on what could be causing this hang? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/spark-hangs-at-broadcasting-during-a-filter-tp24143.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >