Re: distinct on huge dataset
Preferably increase the ulimit on your machines. Spark needs to access a lot of small files hence hard to control file handlers. — Sent from Mailbox On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton wrote: > Btw, I've got System.setProperty("spark.shuffle.consolidate.files", > "true") and use ext3 (CentOS...) > On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote: >> Does this continue in newer versions? (I'm on 0.8.0 now) >> >> When I use .distinct() on moderately large datasets (224GB, 8.5B rows, >> I'm guessing about 500M are distinct) my jobs fail with: >> >> 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to >> java.io.FileNotFoundException >> java.io.FileNotFoundException: >> /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many >> open files) >> >> ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a >> worker node during a failed .distinct(): >> http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark >> tries to open 32000 files. >> >> I never ran into this in 0.7.3. Is there a parameter I can set to tell >> Spark to use less than 32000 files? >> >> On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson wrote: >>> Look up setting ulimit, though note the distinction between soft and hard >>> limits, and that updating your hard limit may require changing >>> /etc/security/limits.confand restarting each worker. >>> >>> >>> On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: >>>> >>>> Got a bit further, i think out of memory error was caused by setting >>>> spark.spill to false. Now i have this error, is there an easy way to >>>> increase file limit for spark, cluster-wide?: >>>> >>>> java.io.FileNotFoundException: >>>> >>>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 >>>> (Too many open files) >>>> at java.io.FileOutputStream.openAppend(Native Method) >>>> at java.io.FileOutputStream.(FileOutputStream.java:192) >>>> at >>>> >>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) >>>> at >>>> >>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) >>>> at >>>> >>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) >>>> at >>>> >>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) >>>> at >>>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) >>>> at >>>> >>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) >>>> at >>>> >>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) >>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >>>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >>>> at >>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) >>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) >>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) >>>> at >>>> >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) >>>> at >>>> >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:53) >>>> at >>>> >>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >>>> at >>>> >>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>>
Re: distinct on huge dataset
Btw, I've got System.setProperty("spark.shuffle.consolidate.files", "true") and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote: > Does this continue in newer versions? (I'm on 0.8.0 now) > > When I use .distinct() on moderately large datasets (224GB, 8.5B rows, > I'm guessing about 500M are distinct) my jobs fail with: > > 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: > /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many > open files) > > ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a > worker node during a failed .distinct(): > http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark > tries to open 32000 files. > > I never ran into this in 0.7.3. Is there a parameter I can set to tell > Spark to use less than 32000 files? > > On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson wrote: >> Look up setting ulimit, though note the distinction between soft and hard >> limits, and that updating your hard limit may require changing >> /etc/security/limits.confand restarting each worker. >> >> >> On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: >>> >>> Got a bit further, i think out of memory error was caused by setting >>> spark.spill to false. Now i have this error, is there an easy way to >>> increase file limit for spark, cluster-wide?: >>> >>> java.io.FileNotFoundException: >>> >>> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 >>> (Too many open files) >>> at java.io.FileOutputStream.openAppend(Native Method) >>> at java.io.FileOutputStream.(FileOutputStream.java:192) >>> at >>> >>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) >>> at >>> >>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) >>> at >>> >>> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) >>> at >>> >>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) >>> at >>> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) >>> at >>> >>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) >>> at >>> >>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) >>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >>> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) >>> at >>> >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) >>> at >>> >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) >>> at org.apache.spark.scheduler.Task.run(Task.scala:53) >>> at >>> >>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >>> at >>> >>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> at java.lang.Thread.run(Thread.java:662) >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >>
Re: distinct on huge dataset
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /tmp/spark-local-20140417145643-a055/3c/shuffle_1_218_1157 (Too many open files) ulimit -n tells me I can open 32000 files. Here's a plot of lsof on a worker node during a failed .distinct(): http://i.imgur.com/wyBHmzz.png , you can see tasks fail when Spark tries to open 32000 files. I never ran into this in 0.7.3. Is there a parameter I can set to tell Spark to use less than 32000 files? On Mon, Mar 24, 2014 at 10:23 AM, Aaron Davidson wrote: > Look up setting ulimit, though note the distinction between soft and hard > limits, and that updating your hard limit may require changing > /etc/security/limits.confand restarting each worker. > > > On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: >> >> Got a bit further, i think out of memory error was caused by setting >> spark.spill to false. Now i have this error, is there an easy way to >> increase file limit for spark, cluster-wide?: >> >> java.io.FileNotFoundException: >> >> /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 >> (Too many open files) >> at java.io.FileOutputStream.openAppend(Native Method) >> at java.io.FileOutputStream.(FileOutputStream.java:192) >> at >> >> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) >> at >> >> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) >> at >> >> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) >> at >> >> org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) >> at >> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) >> at >> >> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) >> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >> at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) >> at >> >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) >> at >> >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) >> at org.apache.spark.scheduler.Task.run(Task.scala:53) >> at >> >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >> at >> >> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:662) >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >
Re: distinct on huge dataset
Look up setting ulimit, though note the distinction between soft and hard limits, and that updating your hard limit may require changing /etc/security/limits.confand restarting each worker. On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: > Got a bit further, i think out of memory error was caused by setting > spark.spill to false. Now i have this error, is there an easy way to > increase file limit for spark, cluster-wide?: > > java.io.FileNotFoundException: > > /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 > (Too many open files) > at java.io.FileOutputStream.openAppend(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:192) > at > > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) > at > > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) > at > > org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) > at > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) > at > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > at org.apache.spark.scheduler.Task.run(Task.scala:53) > at > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: distinct on huge dataset
Got a bit further, i think out of memory error was caused by setting spark.spill to false. Now i have this error, is there an easy way to increase file limit for spark, cluster-wide?: java.io.FileNotFoundException: /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 (Too many open files) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:192) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:191) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insert(ExternalAppendOnlyMap.scala:141) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Ah, interesting. count() without distinct is streaming and does not require that a single partition fits in memory, for instance. That said, the behavior may change if you increase the number of partitions in your input RDD by using RDD.repartition() On Sun, Mar 23, 2014 at 11:47 AM, Kane wrote: > Yes, there was an error in data, after fixing it - count fails with Out of > Memory Error. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: distinct on huge dataset
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash wrote: > FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and > have it disabled now. The specific error behavior was that a join would > consistently return one count of rows with spill enabled and another count > with it disabled. > > Sent from my mobile phone > On Mar 22, 2014 1:52 PM, "Kane" wrote: > >> But i was wrong - map also fails on big file and setting >> spark.shuffle.spill >> doesn't help. Map fails with the same error. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >
Re: distinct on huge dataset
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, "Kane" wrote: > But i was wrong - map also fails on big file and setting > spark.shuffle.spill > doesn't help. Map fails with the same error. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: distinct on huge dataset
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM if your reduce partitions are too large). On Sat, Mar 22, 2014 at 10:00 AM, Kane wrote: > I mean everything works with the small file. With huge file only count and > map work, distinct - doesn't > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: distinct on huge dataset
I mean everything works with the small file. With huge file only count and map work, distinct - doesn't -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Does it work on a smaller file? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Sat, Mar 22, 2014 at 4:50 AM, Ryan Compton wrote: > Does it work without .distinct() ? > > Possibly related issue I ran into: > > https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E > > On Sat, Mar 22, 2014 at 12:45 AM, Kane wrote: > > It's 0.9.0 > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: distinct on huge dataset
Does it work without .distinct() ? Possibly related issue I ran into: https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E On Sat, Mar 22, 2014 at 12:45 AM, Kane wrote: > It's 0.9.0 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
It's 0.9.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Which version of spark are you running? On Fri, Mar 21, 2014 at 10:45 PM, Kane wrote: > I have a huge 2.5TB file. When i run: > val t = sc.textFile("/user/hdfs/dump.csv") > t.distinct.count > > It fails right away with a lot of: > > Loss was due to java.lang.ArrayIndexOutOfBoundsException > java.lang.ArrayIndexOutOfBoundsException: 1 > at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18) > at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:16) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) > at > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > at org.apache.spark.scheduler.Task.run(Task.scala:53) > at > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > at > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46) > at > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
distinct on huge dataset
I have a huge 2.5TB file. When i run: val t = sc.textFile("/user/hdfs/dump.csv") t.distinct.count It fails right away with a lot of: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 1 at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:18) at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:16) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html Sent from the Apache Spark User List mailing list archive at Nabble.com.