Re: distinct on huge dataset

2014-04-17 Thread Mayur Rustagi
scala:161) >>>> at >>>> >>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:53) >>>> at >>>> >>>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >>>> at >>>> >>>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> at >>>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>>

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
ask.run(Task.scala:53) >>> at >>> >>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >>> at >>> >>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> at java.lang.Thread.run(Thread.java:662) >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >>

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
tor.scala:213) >> at >> >> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecu

Re: distinct on huge dataset

2014-03-24 Thread Aaron Davidson
t; org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.T

Re: distinct on huge dataset

2014-03-24 Thread Kane
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3084.html Sent from the Apache Spark User List mailing list archive at

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
: > Yes, there was an error in data, after fixing it - count fails with Out of > Memory Error. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html > Sent from the Apache Spark User L

Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
ote: > >> But i was wrong - map also fails on big file and setting >> spark.shuffle.spill >> doesn't help. Map fails with the same error. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.

Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
"Kane" wrote: > But i was wrong - map also fails on big file and setting > spark.shuffle.spill > doesn't help. Map fails with the same error. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-

Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List ma

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
er-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >

Re: distinct on huge dataset

2014-03-22 Thread Kane
I mean everything works with the small file. With huge file only count and map work, distinct - doesn't -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html Sent from the Apache Spark User List mailing list archi

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Mayur Rustagi
this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >

Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
> -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Kane
It's 0.9.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-21 Thread Aaron Davidson
> at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >

distinct on huge dataset

2014-03-21 Thread Kane
olExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-hu