Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
have it disabled now. The specific error behavior was that a join would
consistently return one count of rows with spill enabled and another count
with it disabled.

Sent from my mobile phone
On Mar 22, 2014 1:52 PM, "Kane"  wrote:

> But i was wrong - map also fails on big file and setting
> spark.shuffle.spill
> doesn't help. Map fails with the same error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Sprak Job stuck

2014-03-22 Thread Usman Ghani
Were you able to figure out what this was?

You can try setting spark.akka.askTimeout to a larger value. That might
help.


On Thu, Mar 20, 2014 at 10:24 PM, mohit.goyal wrote:

> Hi,
>
> I have run the spark application to process input data of size ~14GB with
> executor memory 10GB. The job got stuck with below message
>
> 14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent
> heart
> beats: 85563ms exceeds 45000ms
>
> But job completed successfully if i increase executor memory 40GB.
>
> Any idea??
>
> Thanks
> Mohit Goyal
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Sprak-Job-stuck-tp2979.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Largest input data set observed for Spark.

2014-03-22 Thread Usman Ghani
I am having similar issues with much smaller data sets. I am using spark
EC2 scripts to launch clusters, but I almost always end up with straggling
executors that take over a node's CPU and memory and end up never finishing.



On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya wrote:

> Hi Reynold,
>
> Nice! What spark configuration parameters did you use to get your job to
> run successfully on a large dataset? My job is failing on 1TB of input data
> (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
> errors just lost executors.
>
> Thanks,
>
> Soila
> On Mar 20, 2014 11:29 AM, "Reynold Xin"  wrote:
>
>> I'm not really at liberty to discuss details of the job. It involves some
>> expensive aggregated statistics, and took 10 hours to complete (mostly
>> bottlenecked by network & io).
>>
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Reynold,
>>>
>>> How complex was that job (I guess in terms of number of transforms and
>>> actions) and how long did that take to process?
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin 
>>> wrote:
>>>
>>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>>> nodes -
>>> > I didn't count the size of the uncompressed data, but I am guessing it
>>> is
>>> > somewhere between 200TB to 700TB.
>>> >
>>> >
>>> >
>>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani 
>>> wrote:
>>> >
>>> > > All,
>>> > > What is the largest input data set y'all have come across that has
>>> been
>>> > > successfully processed in production using spark. Ball park?
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v elos.io
>>> W: www.velos.io
>>>
>>
>>


Re: Shark Table for >22 columns

2014-03-22 Thread Ali Ghodsi
Subacini, the short answer is that we don't really support that yet, but
the good news is that I can show you how to work around it.

The good thing is that we nowadays internally actually convert the Tuples
to Seqs, so we can actually leverage that. The bad thing is that before
converting tuples to sequences we extract the static type of the different
tuple fields. We need those types when we create the table for you to
automatically setup the schema during saveAsTable().

The way around it is to call the underlying API and supply the types of the
elements of the sequence (beware, this API might change in the future):

// assume "rdd" is of type RDD[Seq[Any]], where the Seq actually consists
of two elements, one Int and one String

val tableObject = new RDDTableFunctions(rdd, Seq(implicitly[ClassTag[Int]],
implicitly[ClassTag[String]]))
tableObject.saveAsTable("mySeqTable", Seq("my_int", "my_string"))

Hope that helps,
Best,
Ali



On Fri, Mar 21, 2014 at 4:53 PM, subacini Arunkumar wrote:

> Hi,
>
> I am able to successfully create shark table with 3 columns  and 2 rows.
>
>
>  val recList = List((" value A1", "value B1","value C1"),
>  ("value A2", "value B2","value c2"));
>val dbFields =List ("Col A", "Col B","Col C")
> val rdd = sc.parallelize(recList)
> RDDTable(rdd).saveAsTable("table_1", dbFields)
>
>
> I have a scenario where table will have 60 columns. How to achieve it
> using RDDTable.
>
> I tried creating a List[(Seq[String],Seq[String])] , but it throws below
> exception.Any help /pointer will help.
>
> Exception in thread "main" shark.api.DataTypes$UnknownDataTypeException:
> scala.collection.Seq
> at shark.api.DataTypes.fromClassTag(DataTypes.java:133)
> at shark.util.HiveUtils$$anonfun$1.apply(HiveUtils.scala:106)
> at shark.util.HiveUtils$$anonfun$1.apply(HiveUtils.scala:105)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at shark.util.HiveUtils$.createTableInHive(HiveUtils.scala:105)
> at shark.api.RDDTableFunctions.saveAsTable(RDDTableFunctions.scala:63)
>
> Thanks
> Subacini
>


Re: worker keeps getting disassociated upon a failed job spark version 0.90

2014-03-22 Thread sam
I have this problem too.  Eventually the job fails (on the UI) and hangs the
terminal until I CTRL + C.  (Logs below)

Now the Spark docs explain the heartbeat configuration stuff can be tweaked
to handle GC hangs.  I'm wondering if this is symptomatic of pushing the
cluster a little too hard (we where also running a hdfs balance which died
of an OOM).

What sort of values should I try increasing the configurables too

14/03/22 21:45:47 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on
ip-172-31-0-126.ec2.internal: remote Akka client disassociated
14/03/22 21:45:47 INFO scheduler.TaskSetManager: Re-queueing tasks for 6
from TaskSet 1.0
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 720 (task 1.0:248)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 722 (task 1.0:250)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 686 (task 1.0:214)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 698 (task 1.0:226)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 707 (task 1.0:235)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 709 (task 1.0:237)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 694 (task 1.0:222)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 688 (task 1.0:216)
14/03/22 21:45:47 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 5)
14/03/22 21:45:47 INFO storage.BlockManagerMasterActor: Trying to remove
executor 6 from BlockManagerMaster.
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/6 is now FAILED (Command exited with code 137)
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140322213226-0044/6 removed: Command exited with code 137
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor added:
app-20140322213226-0044/9 on
worker-20140321161205-ip-172-31-0-126.ec2.internal-50034
(ip-172-31-0-126.ec2.internal:50034) with 8 cores
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Granted executor
ID app-20140322213226-0044/9 on hostPort ip-172-31-0-126.ec2.internal:50034
with 8 cores, 13.5 GB RAM
14/03/22 21:45:47 INFO storage.BlockManagerMaster: Removed 6 successfully in
removeExecutor
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/9 is now RUNNING
14/03/22 21:45:49 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_6_236 in memory on ec2-54-84-166-37.compute-1.amazonaws.com:56804
(size: 72.5 MB, free: 3.4 GB)
14/03/22 21:45:49 INFO scheduler.TaskSetManager: Starting task 1.0:216 as
TID 729 on executor 8: ec2-54-84-166-37.compute-1.amazonaws.com
(PROCESS_LOCAL)

 more stuff happens 

14/03/22 21:52:09 ERROR scheduler.TaskSchedulerImpl: Lost executor 12 on
ip-172-31-8-63.ec2.internal: remote Akka client disassociated
14/03/22 21:52:09 INFO scheduler.TaskSetManager: Re-queueing tasks for 12
from TaskSet 1.0
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 828 (task 1.0:339)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 830 (task 1.0:305)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 824 (task 1.0:302)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 827 (task 1.0:313)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 826 (task 1.0:338)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 829 (task 1.0:311)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 823 (task 1.0:314)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 825 (task 1.0:312)
14/03/22 21:52:09 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 10)
14/03/22 21:52:09 INFO storage.BlockManagerMasterActor: Trying to remove
executor 12 from BlockManagerMaster.
14/03/22 21:52:09 INFO storage.BlockManagerMaster: Removed 12 successfully
in removeExecutor
14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 11
disconnected, so removing it
14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 11 on
ec2-54-84-151-18.compute-1.amazonaws.com: remote Akka client disassociated
14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 11
from TaskSet 1.0
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 837 (task 1.0:331)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 831 (task 1.0:341)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 839 (task 1.0:347)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 836 (task 1.0:284)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 794 (task 1.0:271)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 838 (task 1.0:273)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 841 (task 1.0:296)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 840 (task 1.0:276)
14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 11)
14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove
executor 11 from BlockManagerMaster.
14/03/22 21:52:10 INFO storage.BlockManagerMaster: Removed

Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill
doesn't help. Map fails with the same error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in
0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045

You might try setting spark.shuffle.spill to false and see if that runs any
longer (turning off shuffle spill is dangerous, though, as it may cause
Spark to OOM if your reduce partitions are too large).



On Sat, Mar 22, 2014 at 10:00 AM, Kane  wrote:

> I mean everything works with the small file. With huge file only count and
> map work, distinct - doesn't
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but
using java7 works (and it compiles with java target version 1.6 so binaries
are usable from java 6)


On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan wrote:

> Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of
> ram only. Increasing it to 1024MB allowed the assembly to finish
> successfully. Peak usage was around 780MB.
>
> --
> To: user@spark.apache.org
> From: vboylin1...@gmail.com
> Subject: 答复: unable to build spark - sbt/sbt: line 50: killed
> Date: Sat, 22 Mar 2014 20:03:28 +0800
>
>
>  Large memory is need to build spark, I think you should make xmx larger,
> 2g for example.
>  --
> 发件人: Bharath Bhushan 
> 发送时间: 2014/3/22 12:50
> 收件人: user@spark.apache.org
> 主题: unable to build spark - sbt/sbt: line 50: killed
>
> I am getting the following error when trying to build spark. I tried
> various sizes for the -Xmx and other memory related arguments to the java
> command line, but the assembly command still fails.
>
> $ sbt/sbt assembly
> ...
> [info] Compiling 298 Scala sources and 17 Java sources to
> /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
> sbt/sbt: line 50: 10202 Killed  java -Xmx1900m
> -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} "$@"
>
> Versions of software:
> Spark: 0.9.0 (hadoop2 binary)
> Scala: 2.10.3
> Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64
> 3.2.0-54-generic
> Java: 1.6.0_45 (oracle java 6)
>
> I can still use the binaries in bin/ but I was just trying to check if
> "sbt/sbt assembly" works fine.
>
> -- Thanks
>


RE: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Bharath Bhushan
Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram 
only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak 
usage was around 780MB.

To: user@spark.apache.org
From: vboylin1...@gmail.com
Subject: 答复: unable to build spark - sbt/sbt: line 50: killed
Date: Sat, 22 Mar 2014 20:03:28 +0800







Large memory is need to build spark, I think you should make xmx larger, 2g for 
example.


发件人: Bharath Bhushan
发送时间: ‎2014/‎3/‎22 12:50
收件人: user@spark.apache.org
主题: unable to build spark - sbt/sbt: line 50: killed


I am getting the following error when trying to build spark. I tried various 
sizes for the -Xmx and other memory related arguments to the java command line, 
but the assembly command still fails.

$ sbt/sbt assembly
...
[info] Compiling 298 Scala sources and 17 Java sources to 
/vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
sbt/sbt: line 50: 10202 Killed  java -Xmx1900m 
-XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} "$@"

Versions of software:
Spark: 0.9.0 (hadoop2 binary)
Scala: 2.10.3
Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic
Java: 1.6.0_45 (oracle java 6)

I can still use the binaries in bin/ but I was just trying to check if "sbt/sbt 
assembly" works fine.

-- Thanks
  

Re: distinct on huge dataset

2014-03-22 Thread Kane
I mean everything works with the small file. With huge file only count and
map work, distinct - doesn't



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi,

Each of my worker node has its own unique spark.local.dir.

However, when I run spark-shell, the shuffle writes are always written to /tmp 
despite being set when the worker node is started.

By specifying the spark.local.dir for the driver program, it seems to override 
the executor? Is there a way to properly define it in the worker node?

Thanks!

Re: distinct on huge dataset

2014-03-22 Thread Mayur Rustagi
Does it work on a smaller file?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Sat, Mar 22, 2014 at 4:50 AM, Ryan Compton wrote:

> Does it work without .distinct() ?
>
> Possibly related issue I ran into:
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E
>
> On Sat, Mar 22, 2014 at 12:45 AM, Kane  wrote:
> > It's 0.9.0
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread 林武康
Large memory is need to build spark, I think you should make xmx larger, 2g for 
example.

-原始邮件-
发件人: "Bharath Bhushan" 
发送时间: ‎2014/‎3/‎22 12:50
收件人: "user@spark.apache.org" 
主题: unable to build spark - sbt/sbt: line 50: killed

I am getting the following error when trying to build spark. I tried various 
sizes for the -Xmx and other memory related arguments to the java command line, 
but the assembly command still fails.

$ sbt/sbt assembly
...
[info] Compiling 298 Scala sources and 17 Java sources to 
/vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
sbt/sbt: line 50: 10202 Killed  java -Xmx1900m 
-XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} "$@"

Versions of software:
Spark: 0.9.0 (hadoop2 binary)
Scala: 2.10.3
Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic
Java: 1.6.0_45 (oracle java 6)

I can still use the binaries in bin/ but I was just trying to check if "sbt/sbt 
assembly" works fine.

-- Thanks

Yet another question on saving RDD into files

2014-03-22 Thread Jaonary Rabarisoa
Dear all,

As a Spark newbie, I need some help to understand how RDD save to file
behaves. After reading the post on saving single files efficiently

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html

I understand that each partition of the RDD is saved into a separate file,
isn't it ? And in order to get one single file, one should call
coalesce(1,shuffle=true), right ?

The other use case that I have is : append a RDD into existing file. Is it
possible with spark ? Precisely, I have a map transformation that results
vary over time, like a big time series :
 I need to store the result  for further analysis but if I store the RDD in
a different file each time I run the computation I may end with many little
files. A pseudo code of my process is as follow :

every tamestamp do
RDD[Array[Double]].map -> RDD[(timestamp,Double)].save to the same file

What should be the best solution to that ?

Best


Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
Does it work without .distinct() ?

Possibly related issue I ran into:
https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E

On Sat, Mar 22, 2014 at 12:45 AM, Kane  wrote:
> It's 0.9.0
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
It's 0.9.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.