Need help about how hadoop works.

2014-04-23 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding
how hadoop works.

If we have a cluster of 5 computers, and install Spark on the cluster
WITHOUT Hadoop. And then we run the code on one computer: 
val doc = sc.textFile(/home/scalatest.txt,5)
doc.count
Can the count task be distributed to all the 5 computers? Or it is only
run by 5 parallel threads of the current computer?

On th other hand, if we install Hadoop on the cluster and upload the data
into HDFS, when running the same code will this count task be done by 25
threads?

Thank you very much for your help. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Without caching, an RDD will be evaluated multiple times if referenced
multiple times by other RDDs. A silly example:

val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith
ERROR)val r2 = text.map(_ split  )val r3 = (r1 ++ r2).collect()

Here the input file will be scanned twice unless you call .cache() on text.
So if your computation involves nondeterminism (e.g. random number), you
may get different results.


On Tue, Apr 22, 2014 at 11:30 AM, randylu randyl...@gmail.com wrote:

 it's ok when i call doc_topic_dist.cache() firstly.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Mayur Rustagi
Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
question :)


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.com wrote:

 Without caching, an RDD will be evaluated multiple times if referenced
 multiple times by other RDDs. A silly example:

 val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith 
 ERROR)val r2 = text.map(_ split  )val r3 = (r1 ++ r2).collect()

 Here the input file will be scanned twice unless you call .cache() on text.
 So if your computation involves nondeterminism (e.g. random number), you
 may get different results.


 On Tue, Apr 22, 2014 at 11:30 AM, randylu randyl...@gmail.com wrote:

 it's ok when i call doc_topic_dist.cache() firstly.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Need help about how hadoop works.

2014-04-23 Thread Mayur Rustagi
As long as the path is present  available on all machines you should be
able to leverage distribution. HDFS is one way to make that happen, NFS is
another  simple replication is another.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Apr 23, 2014 at 12:12 PM, Carter gyz...@hotmail.com wrote:

 Hi, I am a beginner of Hadoop and Spark, and want some help in
 understanding
 how hadoop works.

 If we have a cluster of 5 computers, and install Spark on the cluster
 WITHOUT Hadoop. And then we run the code on one computer:
 val doc = sc.textFile(/home/scalatest.txt,5)
 doc.count
 Can the count task be distributed to all the 5 computers? Or it is only
 run by 5 parallel threads of the current computer?

 On th other hand, if we install Hadoop on the cluster and upload the data
 into HDFS, when running the same code will this count task be done by 25
 threads?

 Thank you very much for your help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Mayur Rustagi
Very abstract.
EC2 is unlikely culprit.
What are you trying to do. Spark is typically not inconsistent like that
but huge intermediate data, reduce size issues could be involved, but hard
to help without some more detail of what you are trying to achieve.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Tue, Apr 22, 2014 at 7:30 PM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 Sometimes running the very same spark application binary, behaves
 differently with every execution.

- The Ganglia profile is different with every execution: sometimes it
takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next
time it is 0.75 TB...
- Spark UI shows number of succeeded tasks is more than total number
of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
computation keeps carrying on for a long time without returning an answer.
- The only way to get an answer from an application is to hopelessly
keep running that application multiple times, until by some luck it gets
converged.

 I was not able to regenerate this by a minimal code, as it seems some
 random factors affect this behavior. I have a suspicion, but I'm not sure,
 that use of one or more groupByKey() calls intensifies this problem.

 Another source of suspicion is the unpredicted performance of ec2 clusters
 with latency and io.

 Is this a known issue with spark?



Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Good question :)

Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy
val. For Scala lazy val, evaluated value is automatically cached, while
evaluated RDD elements are not cached unless you call .cache() explicitly,
because materializing an RDD can often be expensive. Take local file
reading as an analogy:

val v0 = sc.textFile(input.log).cache()

is similar to a lazy val

lazy val u0 = Source.fromFile(input.log).mkString

while

val v1 = sc.textFile(input.log)

is similar to a function

def u0 = Source.fromFile(input.log).mkString

Think it this way: if you want to “reuse” the evaluated elements, you have
to cache those elements somewhere. Without caching, you have to re-evaluate
the RDD, and the semantics of an uncached RDD simply downgrades to a
function rather than a lazy val.


On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
 question :)


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.comwrote:

 Without caching, an RDD will be evaluated multiple times if referenced
 multiple times by other RDDs. A silly example:

 val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith 
 ERROR)val r2 = text.map(_ split  )val r3 = (r1 ++ r2).collect()

 Here the input file will be scanned twice unless you call .cache() on
 text. So if your computation involves nondeterminism (e.g. random
 number), you may get different results.


 On Tue, Apr 22, 2014 at 11:30 AM, randylu randyl...@gmail.com wrote:

 it's ok when i call doc_topic_dist.cache() firstly.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
To experiment, try this in the Spark shell:

val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x =
  println(x)
  x
}val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1)
(r2 ++ r3).collect()

You’ll see elements in r1 are printed (thus evaluated) twice. By adding
.cache() to r1, you’ll see those elements are printed only once.


On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian lian.cs@gmail.com wrote:

 Good question :)

 Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala
 lazy val. For Scala lazy val, evaluated value is automatically cached,
 while evaluated RDD elements are not cached unless you call 
 .cache()explicitly, because materializing an RDD can often be expensive. Take 
 local
 file reading as an analogy:

 val v0 = sc.textFile(input.log).cache()

 is similar to a lazy val

 lazy val u0 = Source.fromFile(input.log).mkString

 while

 val v1 = sc.textFile(input.log)

 is similar to a function

 def u0 = Source.fromFile(input.log).mkString

 Think it this way: if you want to “reuse” the evaluated elements, you have
 to cache those elements somewhere. Without caching, you have to re-evaluate
 the RDD, and the semantics of an uncached RDD simply downgrades to a
 function rather than a lazy val.


 On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
 question :)


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.comwrote:

 Without caching, an RDD will be evaluated multiple times if referenced
 multiple times by other RDDs. A silly example:

 val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith 
 ERROR)val r2 = text.map(_ split  )val r3 = (r1 ++ r2).collect()

 Here the input file will be scanned twice unless you call .cache() on
 text. So if your computation involves nondeterminism (e.g. random
 number), you may get different results.


 On Tue, Apr 22, 2014 at 11:30 AM, randylu randyl...@gmail.com wrote:

 it's ok when i call doc_topic_dist.cache() firstly.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.







Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread wxhsdp
i have a similar question
i'am testing in standalone mode in only one pc. 
i use ./sbin/start-master.sh to start a master and
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
to connect to the master

from the web ui, i can see the local worker registered 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4647/Screenshot_from_2014-04-23_16%5E%2551%5E%2553.png
 

but when i run whatever applications(for example the SimpleApp.scala in
quick start with input file=4.5KB), it failed with
14/04/23 16:39:13 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
14/04/23 16:39:13 WARN scheduler.TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
at 
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at 
org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)

i use sbt to run the application
SBT_OPTS=-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled
-XX:MaxPermSize=256M
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar $@





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskSchedulerImpl-Lost-an-executor-tp4566p4647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer

2014-04-23 Thread Spyros Gasteratos

Hello everyone,
I'm a newbie in both hadoop and spark so please forgive any obvious
mistakes, I'm posting because my google-fu has failed me.

I'm trying to run a test Spark script in order to connect Spark to hadoop.
The script is the following

 from pyspark import SparkContext

 sc = SparkContext(local, Simple App)
 file = sc.textFile(hdfs://hadoop_node.place:9000/errs.txt)
 errors = file.filter(lambda line: ERROR in line)
 errors.count()

When I run it with pyspark I get

py4j.protocol.Py4JJavaError: An error occurred while calling o21.collect. :
java.io.IOException: Can't get Master Kerberos principal for use as renewer
at

org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at
scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at
org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at
scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:46) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at
scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:898) at
org.apache.spark.rdd.RDD.collect(RDD.scala:608) at
org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:243)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:27) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:207) at
java.lang.Thread.run(Thread.java:744)

This happens despite the facts that

   - I've done a kinit and a klist shows I have the correct tokens
   - when I issue a ./bin/hadoop fs -ls
hdfs://hadoop_node.place:9000/errs.txt it shows the file
   - Both the local hadoop client and spark have the same configuration file

The core-site.xml in the spark/conf and hadoop/conf folders is the following
(got it from one of the hadoop nodes)

configuration
property

namehadoop.security.auth_to_local/name
value
RULE:[1:$1](.*@place)s/@place//
RULE:[2:$1/$2@$0](.**/node1.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node2.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node3.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node4.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node5.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node6.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.**/node7.place@place)s/*^([a-zA-Z]*).*/$1/
RULE:[2:nobody]
DEFAULT
/value
/property
property
namenet.topology.node.switch.mapping.impl/name
valueorg.apache.hadoop.net.TableMapping/value
/property
property
namenet.topology.table.file.name/name
value/etc/hadoop/conf/topology.table.file/value
/property
property
namefs.defaultFS/name
valuehdfs://server.place:9000//value
/property
property
  namehadoop.security.authentication/name
  valuekerberos/value
/property

property
  namehadoop.security.authorization/name
  valuetrue/value
/property

property
  namehadoop.proxyuser.hive.hosts/name
  value*/value
/property

property
  namehadoop.proxyuser.hive.groups/name
  value*/value
/property

/configuration

Can someone point out what am I missing?



Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Daniel Darabos
With the right program you can always exhaust any amount of memory :).
There is no silver bullet. You have to figure out what is happening in your
code that causes a high memory use and address that. I spent all of last
week doing this for a simple program of my own. Lessons I learned that may
or may not apply to your case:

 - If you don't cache (persist) an RDD, it is not stored. This can save
memory at the cost of possibly repeating computation. (I read around a TB
of files twice, for example, rather than cache them.)
 - Use combineByKey instead of groupByKey if you can process values one by
one. This means they do not need to be all stored.
 - If you have a lot of keys per partition, set mapSideCombine=false for
combineByKey. This avoids creating a large map per partition.
 - If you have a key with a disproportionate number of values (like the
empty string for a missing name), discard it before the computation.
 - Read https://spark.apache.org/docs/latest/tuning.html for more (and more
accurate) information.

Good luck.


On Wed, Apr 23, 2014 at 1:25 AM, jaeholee jho...@lbl.gov wrote:

 Ok. I tried setting the partition number to 128 and numbers greater than
 128,
 and now I get another error message about Java heap space. Is it possible
 that there is something wrong with the setup of my Spark cluster to begin
 with? Or is it still an issue with partitioning my data? Or do I just need
 more worker nodes?


 ERROR TaskSetManager: Task 194.0:14 failed 4 times; aborting job
 org.apache.spark.SparkException: Job aborted: Task 194.0:14 failed 4 times
 (most recent failure: Exception failure: java.lang.OutOfMemoryError: Java
 heap space)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskSchedulerImpl-Lost-an-executor-tp4566p4623.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think
the fix will be in the next release. But until then, do:

g.edges.map(_.copy()).distinct.count



On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote:

 Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt

 This code:

 println(g.numEdges)
 println(g.numVertices)
 println(g.edges.distinct().count())

 gave me

 1
 9294
 2



 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote:
  I wasn't able to reproduce this with a small test file, but I did change
 the
  file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to
 take
  the third column rather than the second?
 
  If so, would you mind posting a larger sample of the file, or even the
 whole
  file if possible?
 
  Here's the test that succeeded:
 
test(graph.edges.distinct.count) {
  withSpark { sc =
val edgeFullStrRDD: RDD[String] = sc.parallelize(List(
  394365859\t136153151, 589404147\t1361045425))
val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
  .map(x = (x(0).toLong, x(1).toLong))
val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
  uniqueEdges = Option(CanonicalRandomVertexCut))
assert(edgeTupRDD.distinct.count() === 2)
assert(g.numEdges === 2)
assert(g.edges.distinct.count() === 2)
  }
}
 
  Ankur



Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread randylu
i got it, thanks very much :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Andras Barjak

- Spark UI shows number of succeeded tasks is more than total number
of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
computation keeps carrying on for a long time without returning an answer.

 No sign of resubmitted tasks in the command line logs either?
You might want to get more information on what is going on in the JVM?
I don't know what others use but jvmtop is easy to install on ec2 and you
can monitor some processes.


- The only way to get an answer from an application is to hopelessly
keep running that application multiple times, until by some luck it gets
converged.

 I was not able to regenerate this by a minimal code, as it seems some
 random factors affect this behavior. I have a suspicion, but I'm not sure,
 that use of one or more groupByKey() calls intensifies this problem.

Is this related to the amount of data you are processing? Is it more likely
to happen on large data?
My experience on ec2 is whenever the the memory/partitioning/timout
settings are reasonable
the output is quite consistent. Even if I stop and restart the cluster the
other day.


about rdd.filter()

2014-04-23 Thread randylu
  my code is like:
rdd2 = rdd1.filter(_._2.length  1)
rdd2.collect()
  it works well, but if i use a variable /num/ instead of 1:
var num = 1
rdd2 = rdd1.filter(_._2.length  num)
rdd2.collect()
  it fails at rdd2.collect()
  so strange?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


SparkException: env SPARK_YARN_APP_JAR is not set

2014-04-23 Thread ????
I have a small program, which I can launch successfully by yarn client with 
yarn-standalon mode. 

the command look like this: 
(javac javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar 
LoadTest.java) 
(jar cvf loadtest.jar LoadTest.class) 
SPARK_JAR=assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar 
./bin/spark-class org.apache.spark.deploy.yarn.Client --jar 
/opt/mytest/loadtest.jar --class LoadTest --args yarn-standalone --num-workers 
2 --master-memory 2g --worker-memory 2g --worker-cores 1 

the program LoadTest.java: 
public class LoadTest { 
static final String USER = root; 
public static void main(String[] args) { 
System.setProperty(user.name, USER); 
System.setProperty(HADOOP_USER_NAME, USER); 
System.setProperty(spark.executor.memory, 7g); 
JavaSparkContext sc = new JavaSparkContext(args[0], LoadTest, 
System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(LoadTest.class)); 
String file = file:/opt/mytest/123.data; 
JavaRDDString data1 = sc.textFile(file, 2); 
long c1=data1.count(); 
System.out.println(1+c1); 
} 
} 

BUT due to my other pragram's need, I must have it run with command of java. 
So I add ??environment?? parameter to JavaSparkContext(). Followed is The ERROR 
I get: 
Exception in thread main org.apache.spark.SparkException: env 
SPARK_YARN_APP_JAR is not set 
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:125) 
at org.apache.spark.SparkContext.init(SparkContext.scala:200) 
at org.apache.spark.SparkContext.init(SparkContext.scala:100) 
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:93) 
at LoadTest.main(LoadTest.java:37) 

the program LoadTest.java: 
public class LoadTest { 

static final String USER = root; 
public static void main(String[] args) { 
System.setProperty(user.name, USER); 
System.setProperty(HADOOP_USER_NAME, USER); 
System.setProperty(spark.executor.memory, 7g); 

MapString, String env = new HashMapString, String(); 
env.put(SPARK_YARN_APP_JAR, file:/opt/mytest/loadtest.jar); 
env.put(SPARK_WORKER_INSTANCES, 2 ); 
env.put(SPARK_WORKER_CORES, 1); 
env.put(SPARK_WORKER_MEMORY, 2G); 
env.put(SPARK_MASTER_MEMORY, 2G); 
env.put(SPARK_YARN_APP_NAME, LoadTest); 
env.put(SPARK_YARN_DIST_ARCHIVES, 
file:/opt/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar);
 
JavaSparkContext sc = new JavaSparkContext(yarn-client, LoadTest, 
System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(LoadTest.class), env); 
String file = file:/opt/mytest/123.dna; 
JavaRDDString data1 = sc.textFile(file, 2);//.cache(); 

long c1=data1.count(); 
System.out.println(1+c1); 
} 
} 

the command: 
javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest.java 
jar cvf loadtest.jar LoadTest.class 
nohup java -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest  
loadTest.log 21  

What did I miss?? Or I did it in wrong way??

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Parviz Deyhim
You need to set SPARK_MEM or SPARK_EXECUTOR_MEMORY (for Spark 1.0) to
amount of memory your application needs to consume at each node. Try
setting those variables (example: export SPARK_MEM=10g) or set it via
SparkConf.set as suggested by jholee.


On Tue, Apr 22, 2014 at 4:25 PM, jaeholee jho...@lbl.gov wrote:

 Ok. I tried setting the partition number to 128 and numbers greater than
 128,
 and now I get another error message about Java heap space. Is it possible
 that there is something wrong with the setup of my Spark cluster to begin
 with? Or is it still an issue with partitioning my data? Or do I just need
 more worker nodes?


 ERROR TaskSetManager: Task 194.0:14 failed 4 times; aborting job
 org.apache.spark.SparkException: Job aborted: Task 194.0:14 failed 4 times
 (most recent failure: Exception failure: java.lang.OutOfMemoryError: Java
 heap space)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskSchedulerImpl-Lost-an-executor-tp4566p4623.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Comparing RDD Items

2014-04-23 Thread Jared Rodriguez
Hi there,

I am new to Spark and new to scala, although have lots of experience on the
Java side.  I am experimenting with Spark for a new project where it seems
like it could be a good fit.  As I go through the examples, there is one
case scenario that I am trying to figure out, comparing the contents of an
RDD to itself to result in a new RDD.

In an overly simply example, I have:

JavaSparkContext sc = new JavaSparkContext ...
JavaRDDString data = sc.parallelize(buildData());

I then want to compare each entry in data to other entries and end up with:

JavaPairRDDString, ListString mapped = data.???

Is this something easily handled by Spark?  My apologies if this is a
stupid question, I have spent less than 10 hours tinkering with Spark and
am trying to come up to speed.


-- 
Jared Rodriguez


Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Aureliano Buendia
Yes, things get more unstable with larger data. But, that's the whole point
of my question:

Why should spark get unstable when data gets larger?

When data gets larger, spark should get *slower*, not more unstable. lack
of stability makes parameter tuning very difficult, time consuming and a
painful experience.

Also, it is a mystery to me why spark gets unstable in a non-deterministic
fashion. Why should it use twice, or half, the memory it used in the
previous run of exactly the same code?



On Wed, Apr 23, 2014 at 10:43 AM, Andras Barjak 
andras.bar...@lynxanalytics.com wrote:



- Spark UI shows number of succeeded tasks is more than total number
of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
computation keeps carrying on for a long time without returning an answer.

 No sign of resubmitted tasks in the command line logs either?
 You might want to get more information on what is going on in the JVM?
 I don't know what others use but jvmtop is easy to install on ec2 and you
 can monitor some processes.


- The only way to get an answer from an application is to hopelessly
keep running that application multiple times, until by some luck it gets
converged.

 I was not able to regenerate this by a minimal code, as it seems some
 random factors affect this behavior. I have a suspicion, but I'm not sure,
 that use of one or more groupByKey() calls intensifies this problem.

 Is this related to the amount of data you are processing? Is it more
 likely to happen on large data?
 My experience on ec2 is whenever the the memory/partitioning/timout
 settings are reasonable
 the output is quite consistent. Even if I stop and restart the cluster the
 other day.



Re: about rdd.filter()

2014-04-23 Thread Sourav Chandra
This could happen if variable is defined in such a way that it pulls its
own class reference into the closure. Hence serilization tries to
 serialize the whole outer class reference which is not serializable and
whole thing failed.



On Wed, Apr 23, 2014 at 3:15 PM, randylu randyl...@gmail.com wrote:

   my code is like:
 rdd2 = rdd1.filter(_._2.length  1)
 rdd2.collect()
   it works well, but if i use a variable /num/ instead of 1:
 var num = 1
 rdd2 = rdd1.filter(_._2.length  num)
 rdd2.collect()
   it fails at rdd2.collect()
   so strange?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Pig on Spark

2014-04-23 Thread lalit1303
Hi,

We got spork working on spark 0.9.0
Repository available at:
https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix

Please suggest your feedback.



-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p4668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


skip lines in spark

2014-04-23 Thread Chengi Liu
Hi,
  What is the easiest way to skip first n lines in rdd??
I am not able to figure this one out?
Thanks


Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread amit karmakar
Spark hangs after i perform the following operations


ArrayListbyte[] bytesList = new ArrayListbyte[]();
/*
   add 40k entries to bytesList
*/

JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList);
 System.out.println(Count= + rdd.count());


If i add just one entry it works.

It works if i modify,
JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList)
to
JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList, 20);

There is nothing in the logs that can help understand the reason.

What could be reason for this ?


Regards,
Amit Kumar Karmakar


Re: skip lines in spark

2014-04-23 Thread Andre Bois-Crettez

Good question, I am wondering too how it is possible to add a line
number to distributed data.

I thought it was a job for maptPartionsWithIndex, but it seems difficult.
Something similar here :
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html#a995

Maybe at the file reader knowing it works on the first HDFS block, to
count line numbers or something ?

André

On 2014-04-23 18:18, Chengi Liu wrote:

Hi,
  What is the easiest way to skip first n lines in rdd??
I am not able to figure this one out?
Thanks



--
André Bois-Crettez

Software Architect
Big Data Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


error in mllib lr example code

2014-04-23 Thread Mohit Jaggi
sorry...added a subject now

On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote:

 I am trying to run the example linear regression code from

  http://spark.apache.org/docs/latest/mllib-guide.html

 But I am getting the following error...am I missing an import?

 code

 import org.apache.spark._

 import org.apache.spark.mllib.regression.LinearRegressionWithSGD

 import org.apache.spark.mllib.regression.LabeledPoint


 object ModelLR {

   def main(args: Array[String]) {

 val sc = new SparkContext(args(0), SparkLR,

   System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass)
 .toSeq)

 // Load and parse the data

 val data = sc.textFile(mllib/data/ridge-data/lpsa.data)

 val parsedData = data.map { line =

   val parts = line.split(',')

   LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x =
 x.toDouble).toArray)

  }

 ...snip...

 }

 error

 - polymorphic expression cannot be instantiated to expected type; found :
 [U : Double]Array[U] required:

  org.apache.spark.mllib.linalg.Vector

 - polymorphic expression cannot be instantiated to expected type; found :
 [U : Double]Array[U] required:

  org.apache.spark.mllib.linalg.Vector



Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
If the first partition doesn't have enough records, then it may not
drop enough lines. Try

rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)

It might trigger a job.

Best,
Xiangrui

On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote:
 Hi Chengi,

 If you just want to skip first n lines in RDD, you can do

 rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String])
 = {
   if (partitionIdx == 0) {
 lines.drop(n)
   }
   lines
 }


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi,
   What is the easiest way to skip first n lines in rdd??
 I am not able to figure this one out?
 Thanks




Re: Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread Xiangrui Meng
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD sc.parallelize(0 until 20,
20) and call mapPartitions to generate them in parallel. -Xiangrui

On Wed, Apr 23, 2014 at 9:23 AM, amit karmakar
amit.codenam...@gmail.com wrote:
 Spark hangs after i perform the following operations


 ArrayListbyte[] bytesList = new ArrayListbyte[]();
 /*
add 40k entries to bytesList
 */

 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList);
  System.out.println(Count= + rdd.count());


 If i add just one entry it works.

 It works if i modify,
 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList)
 to
 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList, 20);

 There is nothing in the logs that can help understand the reason.

 What could be reason for this ?


 Regards,
 Amit Kumar Karmakar


Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
Sorry, I didn't realize that zipWithIndex() is not in v0.9.1. It is in
the master branch and will be included in v1.0. It first counts number
of records per partition and then assigns indices starting from 0.
-Xiangrui

On Wed, Apr 23, 2014 at 9:56 AM, Chengi Liu chengi.liu...@gmail.com wrote:
 Also, zipWithIndex() is not valid.. Did you meant zipParititions?


 On Wed, Apr 23, 2014 at 9:55 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Xiangrui,
   So, is it that full code suggestion is :
 val trigger = rddData.zipWithIndex().filter(
 _._2 = 10L).map(_._1)

 and then what DB Tsai recommended
 trigger.mapPartitionsWithIndex((partitionIdx: Int, lines:
 Iterator[String]) = {
   if (partitionIdx == 0) {
 lines.drop(n)
   }
   lines
 })

 Is that the full operation..

 What happens, if I have to drop so many records that the number exceeds
 partition 0.. ??
 How do i handle that case?




 On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng men...@gmail.com wrote:

 If the first partition doesn't have enough records, then it may not
 drop enough lines. Try

 rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)

 It might trigger a job.

 Best,
 Xiangrui

 On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote:
  Hi Chengi,
 
  If you just want to skip first n lines in RDD, you can do
 
  rddData.mapPartitionsWithIndex((partitionIdx: Int, lines:
  Iterator[String])
  = {
if (partitionIdx == 0) {
  lines.drop(n)
}
lines
  }
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu chengi.liu...@gmail.com
  wrote:
 
  Hi,
What is the easiest way to skip first n lines in rdd??
  I am not able to figure this one out?
  Thanks
 
 





Re: skip lines in spark

2014-04-23 Thread DB Tsai
What I suggested will not work if # of records you want to drop is more
than the data in first partition. In my use-case, I only drop the first
couple lines, so I don't have this issue.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 9:55 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Xiangrui,
   So, is it that full code suggestion is :
 val trigger = rddData.zipWithIndex().filter(
 _._2 = 10L).map(_._1)

 and then what DB Tsai recommended
 trigger.mapPartitionsWithIndex((partitionIdx: Int, lines:
 Iterator[String]) = {
   if (partitionIdx == 0) {
 lines.drop(n)
   }
   lines
 })

 Is that the full operation..

 What happens, if I have to drop so many records that the number exceeds
 partition 0.. ??
 How do i handle that case?




 On Wed, Apr 23, 2014 at 9:51 AM, Xiangrui Meng men...@gmail.com wrote:

 If the first partition doesn't have enough records, then it may not
 drop enough lines. Try

 rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)

 It might trigger a job.

 Best,
 Xiangrui

 On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote:
  Hi Chengi,
 
  If you just want to skip first n lines in RDD, you can do
 
  rddData.mapPartitionsWithIndex((partitionIdx: Int, lines:
 Iterator[String])
  = {
if (partitionIdx == 0) {
  lines.drop(n)
}
lines
  }
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 9:18 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:
 
  Hi,
What is the easiest way to skip first n lines in rdd??
  I am not able to figure this one out?
  Thanks
 
 





Re: Hadoop—streaming

2014-04-23 Thread Xiangrui Meng
PipedRDD is an RDD[String]. If you know how to parse each result line
into (key, value) pairs, then you can call reduce after.

piped.map(x = (key, value)).reduceByKey((v1, v2) = v)

-Xiangrui

On Wed, Apr 23, 2014 at 2:09 AM, zhxfl 291221...@qq.com wrote:
 Hello,we know Hadoop-streaming is use for Hadoop to run native program.
 Hadoop-streaming supports  Map and Reduce logic. Reduce logic means Hadoop
 collect all values with same key and give the stream for the native
 application.
 Spark has PipeRDD too, but PipeRDD doesn't support Reduce logic. So it's
 difficulty for us to transplant our application from Hadoop to Spark.
 Anyone can give me advise, thanks!


Re: error in mllib lr example code

2014-04-23 Thread Matei Zaharia
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent 
build of the docs; if you spot any problems in those, let us know.

Matei

On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote:

 The doc is for 0.9.1. You are running a later snapshot, which added
 sparse vectors. Try LabeledPoint(parts(0).toDouble,
 Vectors.dense(parts(1).split(' ').map(x = x.toDouble)). The examples
 are updated in the master branch. You can also check the examples
 there. -Xiangrui
 
 On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com wrote:
 
 sorry...added a subject now
 
 On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote:
 
 I am trying to run the example linear regression code from
 
 http://spark.apache.org/docs/latest/mllib-guide.html
 
 But I am getting the following error...am I missing an import?
 
 code
 
 import org.apache.spark._
 
 import org.apache.spark.mllib.regression.LinearRegressionWithSGD
 
 import org.apache.spark.mllib.regression.LabeledPoint
 
 
 object ModelLR {
 
  def main(args: Array[String]) {
 
val sc = new SparkContext(args(0), SparkLR,
 
  System.getenv(SPARK_HOME),
 SparkContext.jarOfClass(this.getClass).toSeq)
 
 // Load and parse the data
 
 val data = sc.textFile(mllib/data/ridge-data/lpsa.data)
 
 val parsedData = data.map { line =
 
  val parts = line.split(',')
 
  LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x =
 x.toDouble).toArray)
 
 }
 
 ...snip...
 
 }
 
 error
 
 - polymorphic expression cannot be instantiated to expected type; found :
 [U : Double]Array[U] required:
 
 org.apache.spark.mllib.linalg.Vector
 
 - polymorphic expression cannot be instantiated to expected type; found :
 [U : Double]Array[U] required:
 
 org.apache.spark.mllib.linalg.Vector
 
 



Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-23 Thread Sandy Ryza
Ah, you're right about SPARK_CLASSPATH and ADD_JARS.  My bad.

SPARK_YARN_APP_JAR is going away entirely -
https://issues.apache.org/jira/browse/SPARK-1053


On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud 
christophe.pre...@kelkoo.com wrote:

  Hi Sandy,

 Thanks for your reply !

 I thought adding the jars in both SPARK_CLASSPATH and ADD_JARS was only
 required as a temporary workaround in spark 0.9.0 (see
 https://issues.apache.org/jira/browse/SPARK-1089), and that it was not
 necessary anymore in 0.9.1

 As for SPARK_YARN_APP_JAR, is it really useful, or is it planned to be
 removed in future versions of Spark? I personally always set it to
 /dev/null when launching a spark-shell in yarn-client mode.

 Thanks again for your time!
 Christophe.


 On 21/04/2014 19:16, Sandy Ryza wrote:

 Hi Christophe,

  Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required.  The
 former makes them available to the spark-shell driver process, and the
 latter tells Spark to make them available to the executor processes running
 on the cluster.

  -Sandy


 On Wed, Apr 16, 2014 at 9:27 AM, Christophe Préaud 
 christophe.pre...@kelkoo.com wrote:

 Hi,

 I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is
 the
 correct way to add external jars when running a spark shell on a YARN
 cluster.

 Packaging all this dependencies in an assembly which path is then set in
 SPARK_YARN_APP_JAR (as written in the doc:
 http://spark.apache.org/docs/latest/running-on-yarn.html) does not work
 in my
 case: it pushes the jar on HDFS in .sparkStaging/application_XXX, but the
 spark-shell is still unable to find it (unless ADD_JARS and/or
 SPARK_CLASSPATH
 is defined)

 Defining all the dependencies (either in an assembly, or separately) in
 ADD_JARS
 or SPARK_CLASSPATH works (even if SPARK_YARN_APP_JAR is set to
 /dev/null), but
 defining some dependencies in ADD_JARS and the rest in SPARK_CLASSPATH
 does not!

 Hence I'm still wondering which are the differences between ADD_JARS and
 SPARK_CLASSPATH, and the purpose of SPARK_YARN_APP_JAR.

 Thanks for any insights!
 Christophe.



 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de EURO 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.




 --
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de EURO 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?

2014-04-23 Thread neveroutgunned
Greetings Spark users/devs! I'm interested in using Spark to process
large volumes of data with a geospatial component, and I haven't been
able to find much information on Spark's ability to handle this kind
of operation. I don't need anything too complex; just distance between
two points, point-in-polygon and the like.

Does Spark (or possibly Shark) support this kind of query? Has anyone
written a plugin/extension along these lines?

If there isn't anything like this so far, then it seems like I have
two options. I can either abandon Spark and fall back on Hadoop and
Hive with the ESRI Tools extension, or I can stick with Spark and try
to write/port a GIS toolkit. Which option do you think I should
pursue? How hard is it for someone that's new to the Spark codebase to
write an extension? Is there anyone else in the community that would
be interested in having geospatial capability in Spark?

Thanks for your help!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-a-good-choice-for-geospatial-GIS-applications-Is-a-community-volunteer-needed-in-this-area-tp4685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something dead simple with which they can do dsl. They end up using
Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
 wraps up a lot of framework level semantics away from the user  lets him
focus on data flow
2. Some have codebases in Pig already  are just looking to do it faster. I
am yet to benchmark that on Pig on spark.

I agree that pig on spark cannot solve a lot problems but it can solve some
without forcing the end customer to do anything even close to coding, I
believe thr is quite some value in making Spark accessible to larger group
of audience.
End of the day to each his own :)

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.comwrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps and
 enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If we
 can improve first 3 lines, here you go, you have most powerful DSL to solve
 data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you please
 keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out would
 love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back (Jan
  22nd).
 
  See the archives:
  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from several
  other groups, and
  if there's enough of it, maybe we can start another open source repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack week.)
 That
  work also calls
  the Scala API directly, because it was done before we had a Java API; it
  should be easier
  with the Java one.
 
 
  Tom
 
 
 
  On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi everyone,
 
  We are using to Pig 

How do I access the SPARK SQL

2014-04-23 Thread diplomatic Guru
Hello Team,

I'm new to SPARK and just came across SPARK SQL, which appears to be
interesting but not sure how I could get it.

I know it's an Alpha version but not sure if its available for community
yet.

Many thanks.

Raj.


Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs
?

Thanks.


On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it faster.
 I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.comwrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you please
 keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out would
 love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from several
  other groups, and
  if there's enough of it, maybe we can start another open source repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack week.)
 That
  work also calls
  the Scala API directly, because it was done before we 

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
UDF
Generate
 many many more are not working :)

Several of them work. Joins, filters, group by etc.
I am translating the ones we need, would be happy to get help on others.
Will host a jira to track them if you are intersted.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it faster.
 I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com
  wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest
  

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread jaeholee
After doing that, I ran my code once with a smaller example, and it worked.
But ever since then, I get the No space left on device message for the
same sample, even if I re-start the master...

ERROR TaskSetManager: Task 29.0:20 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted: Task 29.0:20 failed 4 times
(most recent failure: Exception failure: java.io.IOException: No space left
on device)  
  
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)


at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)


at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
   
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)  
   
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 
at scala.Option.foreach(Option.scala:236)   
  
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

  
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
  
at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
  
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)  
  
at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
  
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskSchedulerImpl-Lost-an-executor-tp4566p4699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How do I access the SPARK SQL

2014-04-23 Thread Matei Zaharia
It’s currently in the master branch, on https://github.com/apache/spark. You 
can check that out from git, build it with sbt/sbt assembly, and then try it 
out. We’re also going to post some release candidates soon that will be 
pre-built.

Matei

On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote:

 Hello Team,
 
 I'm new to SPARK and just came across SPARK SQL, which appears to be 
 interesting but not sure how I could get it.
 
 I know it's an Alpha version but not sure if its available for community yet.
 
 Many thanks.
 
 Raj.



Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.

And is there any way I can port my existing Java MR jobs to SPARK ?
I know this thread has a different subject, let me know if need to ask this
question in separate thread.

Thanks in advance.


On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can 

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Right now UDF is not working. Its in the top list though. You should be
able to soon :)
Are thr any other functionality of pig you use often apart from the usual
suspects??

Existing Java MR jobs would be a easier move. are these cascading jobs or
single map reduce jobs. If single then you should be able to,  write a
scala wrapper code code to call map  reduce functions with some magic 
let your core code be. Would be interesting to see an actual example  get
it to work.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote:

 We currently are in the process of converting PIG and Java map reduce jobs
 to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
 checking if we can leverage SPORK without converting to SPARK jobs.

 And is there any way I can port my existing Java MR jobs to SPARK ?
 I know this thread has a different subject, let me know if need to ask
 this question in separate thread.

 Thanks in advance.


 On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but
 they want something dead simple with which they can do dsl. They end up
 using Hive for a lot of ETL just cause its SQL  they understand it. Pig is
 close  wraps up a lot of framework level semantics away from the user 
 lets him focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
 lot for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power!
 If we can improve first 3 lines, here you go, you have most powerful DSL 
 to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; 

Re: AmpCamp exercise in a local environment

2014-04-23 Thread Nabeel Memon
Thanks a lot Arpit. It's really helpful.


On Fri, Apr 18, 2014 at 4:24 AM, Arpit Tak arpit.sparku...@gmail.comwrote:

 Download Cloudera VM from here.


 https://drive.google.com/file/d/0B7zn-Mmft-XcdTZPLXltUjJyeUE/edit?usp=sharing

 Regards,
 Arpit Tak


 On Fri, Apr 18, 2014 at 1:20 PM, Arpit Tak arpit.sparku...@gmail.comwrote:

 HI Nabeel,

 I have a cloudera VM , It has both spark and shark installed in it.
 You can download and play around with it . i also have some sample data in
 hdfs and some table .

 You can try out those examples. How to use it ..(instructions are in
 docs...).


 https://drive.google.com/file/d/0B0Q4Le4DZj5iSndIcFBfQlcxM1NlV3RNN3YzU1dOT1ZjZHJJ/edit?usp=sharing

 But for AmpCamp-exercises , you need ec2 only to get wikidata on your
 hdfs. For that I have uploaded file(50Mb) . Just download it and put on
 hdfs .. and you can work around these exercises...


 https://drive.google.com/a/mobipulse.in/uc?id=0B0Q4Le4DZj5iNUdSZXpFTUJEU0Eexport=download

 You will love it...

 Regards,
 Arpit Tak


 On Tue, Apr 15, 2014 at 4:28 AM, Nabeel Memon nm3...@gmail.com wrote:

 Hi. I found AmpCamp exercises as a nice way to get started with spark.
 However they require amazon ec2 access. Has anyone put together any VM or
 docker scripts to have the same environment locally to work out those labs?

 It'll be really helpful. Thanks.






Failed to run count?

2014-04-23 Thread Ian Ferreira
I am getting this cryptic  error running LinearRegressionwithSGD

Data sample
LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0])

14/04/23 15:15:34 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/04/23 15:15:34 INFO DAGScheduler: Got job 2 (first at
GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
(allowLocal=true)
14/04/23 15:15:34 INFO DAGScheduler: Final stage: Stage 2 (first at
GeneralizedLinearAlgorithm.scala:121)
14/04/23 15:15:34 INFO DAGScheduler: Parents of final stage: List()
14/04/23 15:15:34 INFO DAGScheduler: Missing parents: List()
14/04/23 15:15:34 INFO DAGScheduler: Computing the requested partition
locally
14/04/23 15:15:34 INFO HadoopRDD: Input split:
file:/Users/iferreira/data/test.csv:0+104
14/04/23 15:15:34 INFO SparkContext: Job finished: first at
GeneralizedLinearAlgorithm.scala:121, took 0.030158 s
14/04/23 15:15:34 INFO SparkContext: Starting job: count at
GradientDescent.scala:137
14/04/23 15:15:34 INFO DAGScheduler: Got job 3 (count at
GradientDescent.scala:137) with 2 output partitions (allowLocal=false)
14/04/23 15:15:34 INFO DAGScheduler: Final stage: Stage 3 (count at
GradientDescent.scala:137)
14/04/23 15:15:34 INFO DAGScheduler: Parents of final stage: List()
14/04/23 15:15:34 INFO DAGScheduler: Missing parents: List()
14/04/23 15:15:34 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[7] at map
at GeneralizedLinearAlgorithm.scala:139), which has no missing parents
14/04/23 15:15:35 INFO DAGScheduler: Failed to run count at
GradientDescent.scala:137

Any clues what may trigger this error, overflow?






RE:

2014-04-23 Thread Buttler, David
This sounds like a configuration issue.  Either you have not set the MASTER 
correctly, or possibly another process is using up all of the cores
Dave

From: ge ko [mailto:koenig@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:


Hi,

I'm still going to start working with Spark and installed the parcels in our 
CDH5 GA cluster.



Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster

Like some advices told me to use FQDN, the settings above sound reasonable for 
me .



Both daemons are running, Master-Web-UI shows the connected worker, and the log 
entries show:

master:

2014-04-13 21:26:40,641 INFO Remoting: Starting remoting
2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master: Starting 
Spark master at spark://hadoop-pg-5.cluster:7077
...

2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector: 
Started 
SelectChannelConnector@0.0.0.0:18080http://SelectChannelConnector@0.0.0.0:18080
2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI: 
Started Master web UI at http://hadoop-pg-5.cluster:18080
2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have been 
elected leader! New state: ALIVE

2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master: Registering 
worker hadoop-pg-5.cluster:7078 with 2 cores, 64.0 MB RAM



worker:

2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
2014-04-13 21:27:39,136 INFO Remoting: Starting remoting
2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078]
2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker: Starting 
Spark worker hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM
2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark home: 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
...

2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector: 
Started 
SelectChannelConnector@0.0.0.0:18081http://SelectChannelConnector@0.0.0.0:18081
2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI: 
Started Worker web UI at http://hadoop-pg-7.cluster:18081
2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker: Connecting 
to master spark://hadoop-pg-5.cluster:7077...
2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker: 
Successfully registered with master spark://hadoop-pg-5.cluster:7077



Looks good, so far.



Now I want to execute the python pi example by executing (on the worker):

cd /opt/cloudera/parcels/CDH/lib/spark  ./bin/pyspark ./python/examples/pi.py 
spark://hadoop-pg-5.cluster:7077



Here the strange thing happens, the script doesn't get executed, it hangs 
(repeating this output forever) at :



14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory



The whole log is:





14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started
14/04/13 21:30:45 INFO Remoting: Starting remoting
14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster
14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140413213045-acec
14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with id = 
ConnectionManagerId(hadoop-pg-7.cluster,57506)
14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager
14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager hadoop-pg-7.cluster:57506 with 294.9 MB RAM
14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at 
http://10.147.210.7:51224
14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker
14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at 
http://hadoop-pg-7.cluster:4040
14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master 
spark://hadoop-pg-5.cluster:7077...
14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID 

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek
Here are some out-of-the-box ideas:  If the elements lie in a fairly small
range and/or you're willing to work with limited precision, you could use
counting sort.  Moreover, you could iteratively find the median using
bisection, which would be associative and commutative.  It's easy to think
of improvements that would make this approach give a reasonable answer in a
few iterations.  I have no idea about mixing algorithmic iterations with
median-finding iterations.


On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton compton.r...@gmail.comwrote:

 I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
 need to update each vertex with the median value of their neighbors.
 Unlike PageRank, which updates each vertex with the mean of their
 neighbors, I don't have a simple commutative and associative function
 to use for mergeMsg.

 What are my options? It looks like I can choose between:

 1. a hacky mergeMsg (i.e. combine a,b - Array(a,b) and then do the
 median in vprog)
 2. collectNeighbors and then median
 3. ignore GraphX and just do the whole thing with joins (which I
 actually got working, but its slow)

 Is there another possibility that I'm missing?



Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
Whoops, I should have mentioned that it's a multivariate median (cf
http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute
when all the values are accessible at once. I'm not sure it's possible
with a combiner. So, I guess the question should be: Can I use
GraphX's Pregel without a combiner?

On Wed, Apr 23, 2014 at 7:01 PM, Tom Vacek minnesota...@gmail.com wrote:
 Here are some out-of-the-box ideas:  If the elements lie in a fairly small
 range and/or you're willing to work with limited precision, you could use
 counting sort.  Moreover, you could iteratively find the median using
 bisection, which would be associative and commutative.  It's easy to think
 of improvements that would make this approach give a reasonable answer in a
 few iterations.  I have no idea about mixing algorithmic iterations with
 median-finding iterations.


 On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton compton.r...@gmail.com
 wrote:

 I'm trying shoehorn a label propagation-ish algorithm into GraphX. I
 need to update each vertex with the median value of their neighbors.
 Unlike PageRank, which updates each vertex with the mean of their
 neighbors, I don't have a simple commutative and associative function
 to use for mergeMsg.

 What are my options? It looks like I can choose between:

 1. a hacky mergeMsg (i.e. combine a,b - Array(a,b) and then do the
 median in vprog)
 2. collectNeighbors and then median
 3. ignore GraphX and just do the whole thing with joins (which I
 actually got working, but its slow)

 Is there another possibility that I'm missing?




Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ankur Dave
If you need access to all message values in vprog, there's nothing wrong
with building up an array in mergeMsg (option #1). This is what
org.apache.spark.graphx.lib.TriangleCount does, though with sets instead of
arrays. There will be a performance penalty because of the communication,
but it sounds like that's unavoidable here.

Ankur http://www.ankurdave.com/

On Wed, Apr 23, 2014 at 8:20 PM, Ryan Compton compton.r...@gmail.com
 wrote:

 1. a hacky mergeMsg (i.e. combine a,b - Array(a,b) and then do the
 median in vprog)



Re: about rdd.filter()

2014-04-23 Thread randylu

14/04/23 17:17:40 INFO DAGScheduler: Failed to run collect at
SparkListDocByTopic.scala:407
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task not serializable: java.io.NotSerializableExceptio
n: SparkListDocByTopic$EnvParameter
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler
.scala:1013)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler
.scala:1002)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler
.scala:1000)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1000)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:77
2)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scal
a:892)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:889)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:889)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:889)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:888)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:888)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:592)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657p4717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: about rdd.filter()

2014-04-23 Thread randylu
@Cheng Lian-2, Sourav Chandra,  thanks very much.
   You are right! The situation just like what you say. so nice !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657p4718.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
hi
i'am testing SimpleApp.scala in standalone mode with only one pc, so i have
one master and one local worker on the same pc

with rather small input file size(4.5K), i have got the
java.lang.OutOfMemoryError: Java heap space error

here's my settings:
spark-env.sh:
export SPARK_MASTER_IP=127.0.0.1
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=2g
export SPARK_JAVA_OPTS+= -Xms512m -Xmx512m  //(1)

SimpleApp.scala:
val conf = new SparkConf()
  .setMaster(spark://127.0.0.1:7077)
  .setAppName(Simple App)
  .set(spark.executor.memory, 1g)  //(2)
val sc = new SparkContext(conf)

sbt:
SBT_OPTS=-Xms512M -Xmx512M //(3)
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar $@

i'am confused with the above (1)(2)(3) settings, and tried several different
options, but all failed
with java.lang.OutOfMemoryError:(

what's the difference between JVM heap size and spark.executor.memory and
how to set them?

i've read some docs and still cannot fully understand

spark.executor.memory: Amount of memory to use per executor process, in the
same format as JVM memory strings (e.g. 512m, 2g).

spark.storage.memoryFraction: Fraction of Java heap to use for Spark's
memory cache.

spark.storage.memoryFraction = 0.6 * spark.executor.memory

is that mean spark.executor.memory = JVM heap size?

here's the logs:
[info] Running SimpleApp 
14/04/24 10:59:41 WARN util.Utils: Your hostname, ubuntu resolves to a
loopback address: 127.0.1.1; using 192.168.0.113 instead (on interface eth0)
14/04/24 10:59:41 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
14/04/24 10:59:42 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/04/24 10:59:42 INFO Remoting: Starting remoting
14/04/24 10:59:42 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@ubuntu.local:46864]
14/04/24 10:59:42 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@ubuntu.local:46864]
14/04/24 10:59:42 INFO spark.SparkEnv: Registering BlockManagerMaster
14/04/24 10:59:42 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140424105942-362c
14/04/24 10:59:42 INFO storage.MemoryStore: MemoryStore started with
capacity 297.0 MB.
14/04/24 10:59:42 INFO network.ConnectionManager: Bound socket to port 34146
with id = ConnectionManagerId(ubuntu.local,34146)
14/04/24 10:59:42 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/04/24 10:59:42 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager ubuntu.local:34146 with 297.0 MB RAM
14/04/24 10:59:42 INFO storage.BlockManagerMaster: Registered BlockManager
14/04/24 10:59:43 INFO spark.HttpServer: Starting HTTP Server
14/04/24 10:59:43 INFO server.Server: jetty-7.6.8.v20121106
14/04/24 10:59:43 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:58936
14/04/24 10:59:43 INFO broadcast.HttpBroadcast: Broadcast server started at
http://192.168.0.113:58936
14/04/24 10:59:43 INFO spark.SparkEnv: Registering MapOutputTracker
14/04/24 10:59:43 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-ce78fc2c-097d-4053-991d-b6bf140d6c33
14/04/24 10:59:43 INFO spark.HttpServer: Starting HTTP Server
14/04/24 10:59:43 INFO server.Server: jetty-7.6.8.v20121106
14/04/24 10:59:43 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:56414
14/04/24 10:59:43 INFO server.Server: jetty-7.6.8.v20121106
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/stage,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/pool,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/environment,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/executors,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/04/24 10:59:43 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/,null}
14/04/24 10:59:43 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/04/24 10:59:43 INFO ui.SparkUI: Started Spark Web UI at
http://ubuntu.local:4040
14/04/24 10:59:43 INFO client.AppClient$ClientActor: Connecting to master
spark://127.0.0.1:7077...
14/04/24 10:59:44 INFO cluster.SparkDeploySchedulerBackend: Connected to
Spark cluster with app ID app-20140424105944-0001
14/04/24 10:59:44 INFO client.AppClient$ClientActor: Executor added:
app-20140424105944-0001/0 on worker-20140424105022-ubuntu.local-40058
(ubuntu.local:40058) with 1 cores
14/04/24 10:59:44 INFO cluster.SparkDeploySchedulerBackend: Granted 

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
by the way, codes run ok in spark shell



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: how to set spark.executor.memory and heap size

2014-04-23 Thread Adnan Yaqoob
When I was testing spark, I faced this issue, this issue is not related to
memory shortage, It is because your configurations are not correct. Try to
pass you current Jar to to the SparkContext with SparkConf's setJars
function and try again.

On Thu, Apr 24, 2014 at 8:38 AM, wxhsdp wxh...@gmail.com wrote:

 by the way, codes run ok in spark shell



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item, is there a straight forward
way to access the last element in a similar way ?

I coudnt fine a tail/last method for RDD. !!


Re: Access Last Element of RDD

2014-04-23 Thread Adnan Yaqoob
You can use following code:

RDD.take(RDD.count())


On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight forward
 way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!



Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Oh ya, Thanks Adnan.


On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!





Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

I want only to access the last element.


On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!






Re: Access Last Element of RDD

2014-04-23 Thread Adnan Yaqoob
This function will return scala List, you can use List's last function to
get the last element.

For example:

RDD.take(RDD.count()).last


On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

 I want only to access the last element.


 On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna 
 ansaiprasa...@gmail.comwrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!







Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
What i observe is, this way of computing is very inefficient. It returns
all the elements of the RDD to a List which takes considerable amount of
time.
Then it calculates the last element.

I have a file of size 3 GB in which i ran a lot of aggregate operations
which dint took the time that this take(RDD.count) took.

Is there an efficient way ? My guess is there should be one, since its a
basic operation.


On Thu, Apr 24, 2014 at 11:14 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 This function will return scala List, you can use List's last function to
 get the last element.

 For example:

 RDD.take(RDD.count()).last


 On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

 I want only to access the last element.


 On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna 
 ansaiprasa...@gmail.comwrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna 
 ansaiprasa...@gmail.comwrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!