Re: Setting queue for spark job on yarn

2014-05-21 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0?

Thanks,
Ron

Sent from my iPad

 On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 
 Hi Sandy,
   Is there a programmatic way? We're building a platform as a service and 
 need to assign it to different queues that can provide different scheduler 
 approaches.
 
 Thanks,
 Ron
 
 Sent from my iPhone
 
 On May 20, 2014, at 1:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Hi Ron,
 
 What version are you using?  For 0.9, you need to set it outside your code 
 with the SPARK_YARN_QUEUE environment variable.
 
 -Sandy
 
 
 On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 Hi,
   How does one submit a spark job to yarn and specify a queue?
   The code that successfully submits to yarn is:
 
val conf = new SparkConf()
val sc = new SparkContext(yarn-client, Simple App, conf)
 
Where do I need to specify the queue?
 
   Thanks in advance for any help on this...
 
 Thanks,
 Ron
 


MLlib ALS-- Errors communicating with MapOutputTracker

2014-05-21 Thread Sue Cai
Hello,

I am currently using MLlib ALS to process a large volume of data, about 1.2
billion Rating(userId, productId, rates) triples. The dataset was sepatated
into 4000 partitions for parallized computation on our yarn clusters. 

I encountered this error Errors communicating with MapOutputTracker,
when trying to get the prediciton rates [model.predict(userproducts)] after
iterations.

val predictions = model.predict(usersProducts).map{
  case Rating(user, product, rate) = ((user, product), rate)
}

I tried to separate the iteration process and the process of culating
prediction rates value by storing the two feature matirces into file system
first; and the loading them for prediction. This time, the error occurred at
the stage of loading userFeatures. 

userfData: userId:[0.3,0.5,0.002,.]

val userfTuple =userfData.map{
  case (line) = {
val arr = line.split(splitmark_1)
val featureArr = arr(1).split(splitmark_2)
(arr(0),featureArr)
  }
}

Here is part of the log:


14-05-21 14:37:17 WARN [Result resolver thread-0] TaskSetManager: Loss was
due to org.apache.spark.SparkException
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:79)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:126)
at
org.apache.spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:43)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:244)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:235)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:90)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:89)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:727)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:723)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:220)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


---

I have tried several methods to solve this problem, one way was to decrease
the number of partitions(from 4000 to 3000), another was to increase the
memory of masters. Both worked, but it is still vital to track the
underneath causes there, right? 

Could anyone help me to check this problem? Thanks a lot. 

Sue Cai



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-ALS-Errors-communicating-with-MapOutputTracker-tp6160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate at which objects are received. If you are using any of the standard
receivers (Flume, Kafka, etc.), I recommended looking at the source code of
the corresponding receiver and making your own version of

Alternatively, there is a open
JIRAhttps://issues.apache.org/jira/browse/SPARK-1341about this
implementing this functionality. You could give it a shot at
implementing this in a generic that it can be used for all receivers ;)




On Tue, May 20, 2014 at 8:31 PM, Francis.Hu francis...@reachjunction.comwrote:

  sparkers,



 Is there a better way to control memory usage when streaming input's speed
 is faster than the speed of handled by spark streaming ?



 Thanks,

 Francis.Hu



Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Apologies for the premature send.

Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate at which objects are received. If you are using any of the standard
receivers (Flume, Kafka, etc.), I recommended looking at the source code of
the corresponding receiver and making your own version of Flume receiver /
Kafka receiver.

Alternatively, there is a open
JIRAhttps://issues.apache.org/jira/browse/SPARK-1341 about
this implementing this functionality. You could give it a shot at
implementing this in a generic that it can be used for all receivers ;)

TD

On Wed, May 21, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Unfortunately, there is no API support for this right now. You could
 implement it yourself by implementing your own receiver and controlling the
 rate at which objects are received. If you are using any of the standard
 receivers (Flume, Kafka, etc.), I recommended looking at the source code of
 the corresponding receiver and making your own version of

 Alternatively, there is a open 
 JIRAhttps://issues.apache.org/jira/browse/SPARK-1341about this implementing 
 this functionality. You could give it a shot at
 implementing this in a generic that it can be used for all receivers ;)




 On Tue, May 20, 2014 at 8:31 PM, Francis.Hu 
 francis...@reachjunction.comwrote:

  sparkers,



 Is there a better way to control memory usage when streaming input's
 speed is faster than the speed of handled by spark streaming ?



 Thanks,

 Francis.Hu





Re: advice on maintaining a production spark cluster?

2014-05-21 Thread sagi
if you saw some exception message like the JIRA
https://issues.apache.org/jira/browse/SPARK-1886  mentioned in work's log
file, you are welcome to have a try https://github.com/apache/spark/pull/827




On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote:

 Aaron:

 I see this in the Master's logs:

 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
 address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
 worker-20140520011737-hdn3.int.meetup.com-50038

 There was an executor that launched that did fail, such as:
 14/05/20 01:16:05 INFO Master: Launching executor
 app-20140520011605-0001/2 on worker
 worker-20140519155427-hdn3.int.meetup.com-50
 038
 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2
 because it is FAILED

 ... but other executors on other machines also failed without permanently
 disassociating.

 There are these messages which I don't know if they are related:
 14/05/20 01:17:38 INFO LocalActorRef: Message
 [akka.remote.transport.AssociationHandle$Disassociated] from
 Actor[akka://sparkMaste
 r/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
 encountered. This logging can be turned off or adjusted with confi
 guration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 14/05/20 01:17:38 INFO LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka
 ://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
 aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
 letters encountered. This logging can be turned off or adjust
 ed with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.




 On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.comwrote:

 Unfortunately, those errors are actually due to an Executor that exited,
 such that the connection between the Worker and Executor failed. This is
 not a fatal issue, unless there are analogous messages from the Worker to
 the Master (which should be present, if they exist, at around the same
 point in time).

 Do you happen to have the logs from the Master that indicate that the
 Worker terminated? Is it just an Akka disassociation, or some exception?


 On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote:

 This isn't helpful of me to say, but, I see the same sorts of problem
 and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
 into when it happens, but usually after heavy use and after running
 for a long time. I had figured I'd see if the changes since 0.9.0
 addressed it and revisit later.

 On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote:
  So, for example, I have two disassociated worker machines at the
 moment.
  The last messages in the spark logs are akka association error
 messages,
  like the following:
 
  14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
  [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error
 [Association
  failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
  akka.remote.EndpointAssociationException: Association failed with
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
  Caused by:
 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
  Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
  ]
 
  On the master side, there are lots and lots of messages of the form:
 
  14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
  worker-20140520011737-hdn3.int.meetup.com-50038
 
  --j
 
 






-- 
-
Best Regards


ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Hi,

I have set up a cluster with Mesos (backed by Zookeeper) with three
master and three slave instances. I set up Spark (git HEAD) for use
with Mesos according to this manual:
http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html

Using the spark-shell, I can connect to this cluster and do simple RDD
operations, but the same code in a Scala class and executed via sbt
run-main works only partially. (That is, count() works, count() after
flatMap() does not.)

Here is my code: https://gist.github.com/tgpfeiffer/7d20a4d59ee6e0088f91
The file SparkExamplesScript.scala, when pasted into spark-shell,
outputs the correct count() for the parallelized list comprehension,
as well as for the flatMapped RDD.

The file SparkExamplesMinimal.scala contains exactly the same code,
and also the MASTER configuration and the Spark Executor are the same.
However, while the count() for the parallelized list is displayed
correctly, I receive the following error when asking for the count()
of the flatMapped RDD:

-

14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting Stage 1
(FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34), which
has no missing parents
14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting 8 missing
tasks from Stage 1 (FlatMappedRDD[1] at flatMap at
SparkExamplesMinimal.scala:34)
14/05/21 09:47:49 INFO scheduler.TaskSchedulerImpl: Adding task set
1.0 with 8 tasks
14/05/21 09:47:49 INFO scheduler.TaskSetManager: Starting task 1.0:0
as TID 8 on executor 20140520-102159-2154735808-5050-1108-1: mesos9-1
(PROCESS_LOCAL)
14/05/21 09:47:49 INFO scheduler.TaskSetManager: Serialized task 1.0:0
as 1779147 bytes in 37 ms
14/05/21 09:47:49 WARN scheduler.TaskSetManager: Lost TID 8 (task 1.0:0)
14/05/21 09:47:49 WARN scheduler.TaskSetManager: Loss was due to
java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-

Can anyone explain to me where this comes from or how I might further
track the problem down?

Thanks,
Tobias


Re: Ignoring S3 0 files exception

2014-05-21 Thread Laurent T
Noone has any idea ?It's really troublesome, it seems like i have no way to
catch errors while an action is beeing processed and just  ignore it.Here's
a bit more details on what i'm doing:
JavaRDD a = sc.textFile(s3n://+missingFilenamePattern) JavaRDD b =
sc.textFile(s3n://+existingFilenamePattern) JavaRDD aPlusB =
a.union(b);aPlusB.reduceByLey(MyReducer); // -- This throws the error
I'd like to ignore the exception caused by a to process b without
troubles.Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ignoring-S3-0-files-exception-tp6101p6163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi,

I want to do union of all RDDs in each window of DStream. I found Dstream.union 
and haven't seen anything like DStream.windowRDDUnion.

Is there any way around it?

I want to find mean and SD of all values which comes under each sliding window 
for which I need to union all the RDDs in each window. This is not a running 
mean and SD.

Regards,
Laeeq

Log analysis

2014-05-21 Thread Shubhabrata
I am new to spark and we are developing a data science pipeline based on
spark on ec2. So far we have been using python on spark standalone cluster.
However, being a newbie I would like to know more about how can I do
debugging (program level) from spark logs (is it stderr ?). I find it a bit
difficult to debug since, spark itself has many messages there. Any ideas or
suggestion regarding configuration change to facilitate this would be highly
appreciated !!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Log-analysis-tp6168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Madhu
Can you identify a specific file that fails?
There might be a real bug here, but I have found gzip to be reliable.
Every time I have run into a bad header error with gzip, I had a non-gzip
file with the wrong extension for whatever reason.




-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


pyspark.rdd.ResultIterable?

2014-05-21 Thread T.J. Alumbaugh
Hi,

I'm noticing a difference between two installations of Spark. I'm pretty
sure both are version 0.9.1. One is able to import
pyspark.rdd.ResultIterable and the other isn't. Is this an environment
problem or do we actually have two different versions of Spark?  To be
clear, on one box, one can do:



In [1]: import pyspark

In [2]: pyspark.rdd.ResultIterable
Out[2]: pyspark.resultiterable.ResultIterable


while on the other pyspark.rdd.ResultIterable is not found. Any ideas?

-T.J.


Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias,

Regarding my comment on closure serialization:

I was discussing it with my fellow Sparkers here and I totally overlooked
the fact that you need the class files to de-serialize the closures (or
whatever) on the workers, so you always need the jar file delivered to the
workers in order for it to work.

The SparkREPL  works differently. It uses some dark magic to send the
working session to the workers.

-kr, Gerard.





On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi Tobias,

 I was curious about this issue and tried to run your example on my local
 Mesos. I was able to reproduce your issue using your current config:

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
 1.0:4 failed 4 times (most recent failure: Exception failure:
 java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2)
 org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times
 (most recent failure: Exception failure: java.lang.ClassNotFoundException:
 spark.SparkExamplesMinimal$$anonfun$2)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

 Creating a simple jar from the job and providing it through the
 configuration seems to solve it:

 val conf = new SparkConf()
   .setMaster(mesos://my_ip:5050/)
 *
 .setJars(Seq(/sparkexample/target/scala-2.10/sparkexample_2.10-0.1.jar))*
   .setAppName(SparkExamplesMinimal)

 Resulting in:
  14/05/21 12:03:45 INFO scheduler.DAGScheduler: Completed ResultTask(1, 1)
 14/05/21 12:03:45 INFO scheduler.DAGScheduler: Stage 1 (count at
 SparkExamplesMinimal.scala:50) finished in 1.120 s
 14/05/21 12:03:45 INFO spark.SparkContext: Job finished: count at
 SparkExamplesMinimal.scala:50, took 1.177091435 s
 count: 100

 Why the closure serialization does not work with Mesos is beyond my
 current knowledge.
 Would be great to hear from the experts (cross-posting to dev for that)

 -kr, Gerard.













 On Wed, May 21, 2014 at 11:51 AM, Tobias Pfeiffer t...@preferred.jpwrote:

 Hi,

 I have set up a cluster with Mesos (backed by Zookeeper) with three
 master and three slave instances. I set up Spark (git HEAD) for use
 with Mesos according to this manual:
 http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html

 Using the spark-shell, I can connect to this cluster and do simple RDD
 operations, but the same code in a Scala class and executed via sbt
 run-main works only partially. (That is, count() works, count() after
 flatMap() does not.)

 Here is my code: https://gist.github.com/tgpfeiffer/7d20a4d59ee6e0088f91
 The file SparkExamplesScript.scala, when pasted into spark-shell,
 outputs the correct count() for the parallelized list comprehension,
 as well as for the flatMapped RDD.

 The file SparkExamplesMinimal.scala contains exactly the same code,
 and also the MASTER configuration and the Spark Executor are the same.
 However, while the count() for the parallelized list is displayed
 correctly, I receive the following error when asking for the count()
 of the flatMapped RDD:

 -

 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting Stage 1
 (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34), which
 has no missing parents
 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting 8 missing
 tasks from Stage 1 (FlatMappedRDD[1] at flatMap at
 SparkExamplesMinimal.scala:34)
 14/05/21 09:47:49 INFO scheduler.TaskSchedulerImpl: Adding task set
 1.0 with 8 tasks
 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Starting task 1.0:0
 as TID 8 on executor 20140520-102159-2154735808-5050-1108-1: mesos9-1
 (PROCESS_LOCAL)
 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Serialized task 1.0:0
 as 1779147 bytes in 37 ms
 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Lost TID 8 (task 1.0:0)
 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 

Re: Python, Spark and HBase

2014-05-21 Thread twizansk
Thanks Nick and Matei.   I'll take a look at the patch and keep you updated.

Tommer



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Mark Hamstra
After the several fixes that we have made to exception handling in Spark
1.0.0, I expect that this behavior will be quite different from 0.9.1.
 Executors should be far more likely to shutdown cleanly in the event of
errors, allowing easier restarts.  But I expect that there will be more
bugs to fix in the next couple of maintenance releases.


On Wed, May 21, 2014 at 8:58 AM, Han JU ju.han.fe...@gmail.com wrote:

 I've seen also worker loss and that's way I asked a question about worker
 re-spawn.

 My typical case is there's some job got OOM exception. Then on the master
 UI some worker's state becomes DEAD.
 In the master's log, there's error like:

 ```
 14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError
 [akka.tcp://sparkmas...@ec2-23-20-189-111.compute-1.amazonaws.com:7077]
 - [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]: Error
 [Association failed with
 [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572
 ]
 14/05/21 15:38:02 INFO master.Master:
 akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572 got
 disassociated, removing it.
 ```

 On the `DEAD` worker machine, there's 2 spark processes, worker and
 executor backend:
   16280 org.apache.spark.deploy.worker.Worker
   25989 org.apache.spark.executor.CoarseGrainedExecutorBackend

 The bad thing is that in this case, a sbin/stop-all.sh and
 sbin/start-all.sh cannot bring back the DEAD worker since the worker
 process cannot be terminated (maybe due to the executor backend). I have to
 log in, kill -9 both worker process and the executor backend.

 I'm on 0.9.1 and using ec2-script.



 2014-05-21 11:42 GMT+02:00 sagi zhpeng...@gmail.com:

 if you saw some exception message like the JIRA
 https://issues.apache.org/jira/browse/SPARK-1886  mentioned in work's
 log file, you are welcome to have a try
 https://github.com/apache/spark/pull/827




 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote:

 Aaron:

 I see this in the Master's logs:

 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
 address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
 worker-20140520011737-hdn3.int.meetup.com-50038

 There was an executor that launched that did fail, such as:
 14/05/20 01:16:05 INFO Master: Launching executor
 app-20140520011605-0001/2 on worker
 worker-20140519155427-hdn3.int.meetup.com-50
 038
 14/05/20 01:17:37 INFO Master: Removing executor
 app-20140520011605-0001/2 because it is FAILED

 ... but other executors on other machines also failed without
 permanently disassociating.

 There are these messages which I don't know if they are related:
  14/05/20 01:17:38 INFO LocalActorRef: Message
 [akka.remote.transport.AssociationHandle$Disassociated] from
 Actor[akka://sparkMaste
 r/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
 encountered. This logging can be turned off or adjusted with confi
 guration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 14/05/20 01:17:38 INFO LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka
 ://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
 aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
 letters encountered. This logging can be turned off or adjust
 ed with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.




 On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.comwrote:

 Unfortunately, those errors are actually due to an Executor that
 exited, such that the connection between the Worker and Executor failed.
 This is not a fatal issue, unless there are analogous messages from the
 Worker to the Master (which should be present, if they exist, at around the
 same point in time).

 Do you happen to have the logs from the Master that indicate that the
 Worker terminated? Is it just an Akka disassociation, or some exception?


 On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote:

 This isn't helpful of me to say, but, I see the same sorts of problem
 and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
 into when it happens, but usually after heavy use and after running
 for a long time. I had figured I'd see if the changes since 0.9.0
 addressed it and revisit later.

 On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com
 wrote:
  So, for 

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
Hi Nick,

Which version of Hadoop are you using with Spark?  I spotted an issue with
the built-in GzipDecompressor while doing something similar with Hadoop
1.0.4, all my Gzip files were valid and tested yet certain files blew up
from Hadoop/Spark.

The following JIRA ticket goes into more detail
https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop
releases prior to 1.2.X

MC




*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com mich...@tumra.comWeb:
tumra.com http://tumra.com/?utm_source=signatureutm_medium=email*
*Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
*Registered in England  Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 21 May 2014 14:26, Madhu ma...@madhu.com wrote:

 Can you identify a specific file that fails?
 There might be a real bug here, but I have found gzip to be reliable.
 Every time I have run into a bad header error with gzip, I had a non-gzip
 file with the wrong extension for whatever reason.




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. 

As long as I stick to HDFS paths, and files I’m good. I do know that code a bit 
but have never used it to say take input from one cluster via 
“hdfs://server:port/path” and output to another via 
“hdfs://another-server:another-port/path”. This seems to be supported by Spark 
so I’ll have to go back and look at how to do this in the HDFS api.

Specifically I’ll need to examine the directory/file structure on one cluster 
then check some things on what is potentially another cluster before output. I 
have usually assumed only one HDFS instance so it may just be a matter of me 
being more careful and preserving full URIs. In the past I may have made 
assumptions that output is to the same dir tree as the input. Maybe it’s a 
matter of being more scrupulous about that assumption.

It’s a bit hard to test this case since I have never really had access to two 
clusters so I’ll have to develop some new habits at least.

On May 18, 2014, at 11:13 AM, Andrew Ash and...@andrewash.com wrote:

Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's 
FileInputFormat.setInputPaths() call.  There is no alternate storage system, 
Spark just delegates to Hadoop for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so you 
can use Spark on data in S3 using s3:/// just the same as you would with HDFS.  
See Apache's documentation on S3 for more details.

As far as interacting with a FileSystem (HDFS or other) to list files, delete 
files, navigate paths, etc. from your driver program, you should be able to 
just instantiate a FileSystem object and use the normal Hadoop APIs from there. 
 The Apache getting started docs on reading/writing from Hadoop DFS should work 
the same for non-HDFS examples too.

I do think we could use a little recipe in our documentation to make 
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind sharing, we 
can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com 
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
 Not that I know of. We were discussing it on another thread and it came up. 
 
 I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
 you'll see it mentioned there in the docs. 
 
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
 
 But that's not obvious.
 
 Nick
 
 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:
 Perfect. 
 
 BTW just so I know where to look next time, was that in some docs?
 
 On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 
 Yep, as I just found out, you can also provide 
 sc.textFile() with a comma-delimited string of all the files you want to load.
 
 For example:
 
 sc.textFile('/path/to/file1,/path/to/file2')
 So once you have your list of files, concatenate their paths like that and 
 pass the single string to 
 textFile().
 
 Nick
 
 
 
 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 sc.textFile(URI) supports reading multiple files in parallel but only with a 
 wildcard. I need to walk a dir tree, match a regex to create a list of files, 
 then I’d like to read them into a single RDD in parallel. I understand these 
 could go into separate RDDs then a union RDD can be created. Is there a way 
 to create a single RDD from a URI list?
 
 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 

Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?
Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
if (args.deployMode == cluster  args.master.startsWith(yarn)) {  
args.master = yarn-cluster}if (args.deployMode != cluster  
args.master.startsWith(yarn)) {  args.master = yarn-client












}
It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  

Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
attached log shows. It was stuck there for more than 10 hours and then I
killed it at last. But I did not find any information explaining why it was
stuck. Any suggestions? Thanks.


Spark_OK_48_pagerank.log
http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui

On Wed, May 21, 2014 at 11:23 AM, yxzhao yxz...@ualr.edu wrote:
 I run the pagerank example processing a large data set, 5GB in size, using 48
 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
 attached log shows. It was stuck there for more than 10 hours and then I
 killed it at last. But I did not find any information explaining why it was
 stuck. Any suggestions? Thanks.


 Spark_OK_48_pagerank.log
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias,

On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote:
first, thanks for your explanations regarding the jar files!
No prob :-)


 On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com
 wrote:
  I was discussing it with my fellow Sparkers here and I totally overlooked
  the fact that you need the class files to de-serialize the closures (or
  whatever) on the workers, so you always need the jar file delivered to
 the
  workers in order for it to work.

 So the closure as a function is serialized, sent across the wire,
 deserialized there, and *still* you need the class files? (I am not
 sure I understand what is actually sent over the network then. Does
 that serialization only contain the values that I close over?)


I also had that mental lapse. Serialization refers to converting object
(not class) state (current values)  into a byte stream and de-serialization
restores the bytes from the wire into an seemingly identical object at the
receiving side (except for transient variables), for that, it requires the
class definition of that object to know what it needs to instantiate, so
yes, the compiled classes need to be given to the Spark driver and it will
take care of dispatching them to the workers (much better than in the old
RMI days ;-)


 If I understand correctly what you are saying, then the documentation
 at https://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html
 (list item 8) needs to be extended quite a bit, right?


The mesos docs have been recently updated here:
https://github.com/apache/spark/pull/756/files
Don't know where the latest version from master is built/available.

-kr, Gerard.


Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
Thanks Xiangrui, How to check and make sure the data is distributed
evenly? Thanks again.
On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n6187...@n3.nabble.com wrote:
 Many OutOfMemoryErrors in the log. Is your data distributed evenly?
 -Xiangrui

 On Wed, May 21, 2014 at 11:23 AM, yxzhao [hidden email] wrote:

 I run the pagerank example processing a large data set, 5GB in size, using
 48
 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
 attached log shows. It was stuck there for more than 10 hours and then I
 killed it at last. But I did not find any information explaining why it
 was
 stuck. Any suggestions? Thanks.


 Spark_OK_48_pagerank.log

 http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6187.html
 To unsubscribe from Job Processing Large Data Set Got Stuck, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
If the RDD is cached, you can check its storage information in the
Storage tab of the Web UI.

On Wed, May 21, 2014 at 12:31 PM, yxzhao yxz...@ualr.edu wrote:
 Thanks Xiangrui, How to check and make sure the data is distributed
 evenly? Thanks again.
 On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User
 List] [hidden email] wrote:

 Many OutOfMemoryErrors in the log. Is your data distributed evenly?
 -Xiangrui

 On Wed, May 21, 2014 at 11:23 AM, yxzhao [hidden email] wrote:

 I run the pagerank example processing a large data set, 5GB in size,
 using
 48
 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
 attached log shows. It was stuck there for more than 10 hours and then I
 killed it at last. But I did not find any information explaining why it
 was
 stuck. Any suggestions? Thanks.


 Spark_OK_48_pagerank.log


 http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


 
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6187.html
 To unsubscribe from Job Processing Large Data Set Got Stuck, click here.
 NAML

 
 View this message in context: Re: Job Processing Large Data Set Got Stuck

 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Andrew Ash
Here's the 1.0.0rc9 version of the docs:
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html
I refreshed them with the goal of steering users more towards prebuilt
packages than relying on compiling from source plus improving overall
formatting and clarity, but not otherwise modifying the content. I don't
expect any changes for rc10.

It does seem like an issue though that classpath issues are preventing that
from running.  Just to check, have you given the exact some jar a shot when
running against a standalone cluster?  If it works in standalone, I think
that's good evidence that there's an issue with the Mesos classloaders in
master.

I'm running into a similar issue with classpaths failing on Mesos but
working in standalone, but I haven't coherently written up my observations
yet so haven't gotten that to this list.

I'd almost gotten to the point where I thought that my custom code needed
to be included in the SPARK_EXECUTOR_URI but that can't possibly be
correct.  The Spark workers that are launched on Mesos slaves should start
with the Spark core jars and then transparently get classes from custom
code over the network, or at least that's who I thought it should work.
 For those who have been using Mesos in previous releases, you've never had
to do that before have you?




On Wed, May 21, 2014 at 3:30 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi Tobias,

 On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote:
 first, thanks for your explanations regarding the jar files!
 No prob :-)


 On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com
 wrote:
  I was discussing it with my fellow Sparkers here and I totally
 overlooked
  the fact that you need the class files to de-serialize the closures (or
  whatever) on the workers, so you always need the jar file delivered to
 the
  workers in order for it to work.

 So the closure as a function is serialized, sent across the wire,
 deserialized there, and *still* you need the class files? (I am not
 sure I understand what is actually sent over the network then. Does
 that serialization only contain the values that I close over?)


 I also had that mental lapse. Serialization refers to converting object
 (not class) state (current values)  into a byte stream and de-serialization
 restores the bytes from the wire into an seemingly identical object at the
 receiving side (except for transient variables), for that, it requires the
 class definition of that object to know what it needs to instantiate, so
 yes, the compiled classes need to be given to the Spark driver and it will
 take care of dispatching them to the workers (much better than in the old
 RMI days ;-)


 If I understand correctly what you are saying, then the documentation
 at
 https://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html
 (list item 8) needs to be extended quite a bit, right?


 The mesos docs have been recently updated here:
 https://github.com/apache/spark/pull/756/files
 Don't know where the latest version from master is built/available.

 -kr, Gerard.



Re: unsubscribe

2014-05-21 Thread Shangyu Luo
Does any one know how to configure the digest mailing list? For example, I
want to receive daily digest, not every 10 messages.
Thanks!


On Mon, May 19, 2014 at 4:29 PM, Shangyu Luo lsy...@gmail.com wrote:

 Hi Andrew and Madhu,
 Thank you for your help here! Will unsubscribe through another address and
 may subscribe digest instead!

 Best,
 Shangyu


 On Sun, May 18, 2014 at 3:49 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Shangyu (and everyone else looking to unsubscribe!),

 If you'd like to get off this mailing list, please send an email to user
 *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org
 list.

 How to use the Apache mailing list infrastructure is documented here:
 https://www.apache.org/foundation/mailinglists.html
 And the Spark User list specifically can be found here:
 http://mail-archives.apache.org/mod_mbox/spark-user/

 Thanks!
 Andrew


 On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo lsy...@gmail.com wrote:

 Thanks!





 --
 --

 Shangyu, Luo
 Department of Computer Science
 Rice University

 --
 Not Just Think About It, But Do It!
 --
 Success is never final.
 --
 Losers always whine about their best




-- 
--

Shangyu, Luo
Department of Computer Science
Rice University

--
Not Just Think About It, But Do It!
--
Success is never final.
--
Losers always whine about their best


Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Andrew,

Thanks for the current doc.

I'd almost gotten to the point where I thought that my custom code needed
 to be included in the SPARK_EXECUTOR_URI but that can't possibly be
 correct.  The Spark workers that are launched on Mesos slaves should start
 with the Spark core jars and then transparently get classes from custom
 code over the network, or at least that's who I thought it should work.
  For those who have been using Mesos in previous releases, you've never had
 to do that before have you?


Regarding the delivery of the custom job code to Mesos, we have been using
'ADD_JARS' (in the command line) or 'SparkConfig.setJars(Seq[String]) with
a fat jar packing all dependencies.
That works as well on the Spark 'standalone' cluster, but we deploy mostly
on Mesos, so I couldn't say about classloading difference between the two.

-greetz, Gerard.


Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
Looking forward to that update!

Given a table of JSON objects like this one:

{
   name: Nick,
   location: {
  x: 241.6,
  y: -22.5
   },
   likes: [ice cream, dogs, Vanilla Ice]}

It would be SUPER COOL if we could query that table in a way that is as
natural as follows:

SELECT DISTINCT nameFROM json_table;
SELECT MAX(location.x)FROM json_table;
SELECT likes[2] -- Ice Ice BabyFROM json_tableWHERE name = Nick;

Of course, this is just a hand-wavy suggestion of how I’d like to be able
to query JSON (particularly that last example) using SQL. I’m interested in
seeing what y’all come up with.

A large part of what my team does is make it easy for analysts to explore
and query JSON data using SQL. We have a fairly complex home-grown process
to do that and are looking to replace it with something more out of the
box. So if you’d like more input on how users might use this feature, I’d
be glad to chime in.

Nick
​


On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust
mich...@databricks.comwrote:

 You can already extract fields from json data using Hive UDFs.  We have an
 intern working on on better native support this summer.  We will be sure to
 post updates once there is a working prototype.

 Michael


 On Tue, May 20, 2014 at 6:46 PM, Nick Chammas 
 nicholas.cham...@gmail.comwrote:

 The Apache Drill http://incubator.apache.org/drill/ home page has an
 interesting heading: Liberate Nested Data.

 Is there any current or planned functionality in Spark SQL or Shark to
 enable SQL-like querying of complex JSON?

 Nick


 --
 View this message in context: Using Spark to analyze complex 
 JSONhttp://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-analyze-complex-JSON-tp6146.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.





tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
I have a few test cases for Spark which extend TestSuiteBase from 
org.apache.spark.streaming.
The tests run fine on my machine but when I commit to repo and run the tests 
automatically with bamboo the test cases fail with these errors.

How to fix?


21-May-2014 16:33:09

[info] StreamingZigZagSpec:

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream *** FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 
times (most recent failure: Exception failure: 
java.io.StreamCorruptedException: invalid type code: AC)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

21-May-2014 16:33:09

[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at scala.Option.foreach(Option.scala:236)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

21-May-2014 16:33:09

[info]   ...

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream with intermittent empty RDDs *** 
FAILED ***

21-May-2014 16:33:09

[info]   Operation timed out after 10042 ms (TestSuiteBase.scala:283)

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 
times (most recent failure: Exception failure: java.io.FileNotFoundException: 
/tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or 
directory))

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

21-May-2014 16:33:09

[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at scala.Option.foreach(Option.scala:236)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

21-May-2014 16:33:09

[info]   ...

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream w notification for each change  *** 
FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 141.0:0 failed 1 
times (most recent failure: Exception failure: java.io.FileNotFoundException: 
http://10.10.1.9:62793/broadcast_1)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

21-May-2014 16:33:09

[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 

Inconsistent RDD Sample size

2014-05-21 Thread glxc
I have a graph and am trying to take a random sample of vertices without
replacement, using the RDD.sample() method

verts are the vertices in the graph

  val verts = graph.vertices

and executing this multiple times in a row 

  verts.sample(false, 1.toDouble/v1.count.toDouble,
 System.currentTimeMillis).count

yields different results roughly each time (albeit +/- a small % of the
target)

why does this happen? Looked at PartionwiseSampledRDD but can't figure it
out

Also, is there another method/technique to yield the same result each time? 
My understanding is that grabbing random indices may not be the best use of
the RDD model



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
Just found this at the top of the log:

17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN  
o.e.j.u.component.AbstractLifeCycle - FAILED 
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in 
use
build   21-May-2014 17:14:41   java.net.BindException: Address already in use

Is there a way to set these connection up so that they don't all start on the 
same port (that's my guess for the root cause of the issue)

From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: May-21-14 4:58 PM
To: u...@spark.incubator.apache.org; user@spark.apache.org
Subject: tests that run locally fail when run through bamboo

I have a few test cases for Spark which extend TestSuiteBase from 
org.apache.spark.streaming.
The tests run fine on my machine but when I commit to repo and run the tests 
automatically with bamboo the test cases fail with these errors.

How to fix?


21-May-2014 16:33:09

[info] StreamingZigZagSpec:

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream *** FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 
times (most recent failure: Exception failure: 
java.io.StreamCorruptedException: invalid type code: AC)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

21-May-2014 16:33:09

[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at scala.Option.foreach(Option.scala:236)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

21-May-2014 16:33:09

[info]   ...

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream with intermittent empty RDDs *** 
FAILED ***

21-May-2014 16:33:09

[info]   Operation timed out after 10042 ms (TestSuiteBase.scala:283)

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 
times (most recent failure: Exception failure: java.io.FileNotFoundException: 
/tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or 
directory))

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

21-May-2014 16:33:09

[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at scala.Option.foreach(Option.scala:236)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

21-May-2014 16:33:09

[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

21-May-2014 16:33:09

[info]   ...

21-May-2014 16:33:09

[info] - compute zigzag indicator in stream w notification for each change  *** 
FAILED ***

21-May-2014 16:33:09

[info]   org.apache.spark.SparkException: Job aborted: Task 141.0:0 failed 1 
times (most recent failure: Exception failure: java.io.FileNotFoundException: 
http://10.10.1.9:62793/broadcast_1)

21-May-2014 16:33:09

[info]   at 

Re: Inconsistent RDD Sample size

2014-05-21 Thread Xiangrui Meng
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui

On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote:
 I have a graph and am trying to take a random sample of vertices without
 replacement, using the RDD.sample() method

 verts are the vertices in the graph

  val verts = graph.vertices

 and executing this multiple times in a row

  verts.sample(false, 1.toDouble/v1.count.toDouble,
 System.currentTimeMillis).count

 yields different results roughly each time (albeit +/- a small % of the
 target)

 why does this happen? Looked at PartionwiseSampledRDD but can't figure it
 out

 Also, is there another method/technique to yield the same result each time?
 My understanding is that grabbing random indices may not be the best use of
 the RDD model



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Problem with loading files: Loss was due to java.io.EOFException java.io.EOFException

2014-05-21 Thread hakanilter
The problem is solved after hadoop-core dependency added. But I think there
is a misunderstanding about local files. I found this one:

Note that if you've connected to a Spark master, it's possible that it will
attempt to load the file on one of the different machines in the cluster, so
make sure it's available on all the cluster machines. In general, in future
you will want to put your data in HDFS, S3, or similar file systems to avoid
this problem.

http://docs.sigmoidanalytics.com/index.php/Using_the_Spark_Shell

This means that you can't use local files with spark. I don't understand
why, because after calling addFile() or textFile(), the file can be
downloaded by every node on the cluster and became accessible. 

Anyway, if you got Loss was due to java.io.EOFException, you have to make
sure that hadoop libs are available.

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.1/version
/dependency
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-core/artifactId
version2.0.0-mr1-cdh4.6.0/version
/dependency
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-common/artifactId
version2.0.0-cdh4.6.0/version
/dependency
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.0.0-cdh4.6.0/version
/dependency

Cheers!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-loading-files-Loss-was-due-to-java-io-EOFException-java-io-EOFException-tp6090p6201.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. 

Date: Wed, 21 May 2014 11:07:55 -0700
Subject: Re: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster 
mode or yarn-client mode?
From: and...@databricks.com
To: user@spark.apache.org

The answer is actually yarn-client. A quick way to find out:
$ bin/spark-shell --master yarn --verbose
From the system properties you can see spark.master is set to yarn-client. 
From the code, this is because args.deployMode is null, and so it's not equal 
to cluster and so it falls into the second if case you mentioned:

if (args.deployMode != cluster  args.master.startsWith(yarn)) {
  args.master = yarn-client}

2014-05-21 10:57 GMT-07:00 Andrew Lee alee...@hotmail.com:




Does anyone know if:
./bin/spark-shell --master yarn 
is running yarn-cluster or yarn-client by default?

Base on source code:







./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala


if (args.deployMode == cluster  args.master.startsWith(yarn)) {

  args.master = yarn-cluster
}
if (args.deployMode != cluster  args.master.startsWith(yarn)) {

  args.master = yarn-client














}

It looks like the answer is yarn-cluster mode.
I want to confirm this with the community, thanks.  
  


  

Re: tests that run locally fail when run through bamboo

2014-05-21 Thread Tathagata Das
This do happens sometimes, but it is a warning because Spark is designed
try successive ports until it succeeds. So unless a cray number of
successive ports are blocked (runaway processes?? insufficient clearing of
ports by OS??), these errors should not be a problem for tests passing.


On Wed, May 21, 2014 at 2:31 PM, Adrian Mocanu amoc...@verticalscope.comwrote:

  Just found this at the top of the log:



 17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN
 o.e.j.u.component.AbstractLifeCycle - FAILED
 SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
 already in use

 build   21-May-2014 17:14:41   java.net.BindException: Address already in
 use



 Is there a way to set these connection up so that they don’t all start on
 the same port (that’s my guess for the root cause of the issue)



 *From:* Adrian Mocanu [mailto:amoc...@verticalscope.com]
 *Sent:* May-21-14 4:58 PM
 *To:* u...@spark.incubator.apache.org; user@spark.apache.org
 *Subject:* tests that run locally fail when run through bamboo



 I have a few test cases for Spark which extend TestSuiteBase from
 org.apache.spark.streaming.

 The tests run fine on my machine but when I commit to repo and run the
 tests automatically with bamboo the test cases fail with these errors.



 How to fix?





 21-May-2014 16:33:09

 [info] StreamingZigZagSpec:

 21-May-2014 16:33:09

 [info] - compute zigzag indicator in stream *** FAILED ***

 21-May-2014 16:33:09

 [info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1
 times (most recent failure: Exception failure:
 java.io.StreamCorruptedException: invalid type code: AC)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

 21-May-2014 16:33:09

 [info]   at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 21-May-2014 16:33:09

 [info]   at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 21-May-2014 16:33:09

 [info]   at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at scala.Option.foreach(Option.scala:236)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

 21-May-2014 16:33:09

 [info]   ...

 21-May-2014 16:33:09

 [info] - compute zigzag indicator in stream with intermittent empty RDDs
 *** FAILED ***

 21-May-2014 16:33:09

 [info]   Operation timed out after 10042 ms (TestSuiteBase.scala:283)

 21-May-2014 16:33:09

 [info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED
 ***

 21-May-2014 16:33:09

 [info]   org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1
 times (most recent failure: Exception failure:
 java.io.FileNotFoundException:
 /tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or
 directory))

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)

 21-May-2014 16:33:09

 [info]   at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 21-May-2014 16:33:09

 [info]   at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 21-May-2014 16:33:09

 [info]   at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at scala.Option.foreach(Option.scala:236)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)

 21-May-2014 16:33:09

 [info]   at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)

 21-May-2014 16:33:09

 [info]   ...

 

Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Kevin Markey

  
  
I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster
mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or
2.2. The application successfully ran to conclusion but it
ultimately failed. 

There were 2 anomalies...

1. ASM reported only that the application was "ACCEPTED". It never
indicated that the application was "RUNNING."
14/05/21 16:06:12 INFO yarn.Client: Application
report from ASM:
   application identifier: application_1400696988985_0007
   appId: 7
   clientToAMToken: null
   appDiagnostics:
   appMasterHost: N/A
   appQueue: default
   appMasterRpcPort: -1
   appStartTime: 1400709970857
   yarnAppState: ACCEPTED
   distributedFinalState: UNDEFINED
   appTrackingUrl:
http://Sleepycat:8088/proxy/application_1400696988985_0007/
   appUser: hduser

Furthermore, it started a second container, running two
partly overlapping drivers, when it appeared that the
application never started. Each container ran to conclusion as
explained above, taking twice as long as usual for both to
complete. Both instances had the same concluding failure.

2. Each instance failed as indicated by the stderr log, finding that
the filesystem was closed when trying to clean up the
staging directories. 

14/05/21 16:08:24 INFO Executor: Serialized size of result for
  1453 is 863
14/05/21 16:08:24 INFO Executor: Sending result for 1453
  directly to driver
14/05/21 16:08:24 INFO Executor: Finished task ID 1453
14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in
  202 ms on localhost (progress: 2/2)
14/05/21 16:08:24 INFO DAGScheduler: Completed
  ResultTask(1507, 1)
14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet
  1507.0, whose tasks have all completed, from pool
14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at
  KEval.scala:32) finished in 0.417 s
14/05/21 16:08:24 INFO SparkContext: Job finished: count at
  KEval.scala:32, took 1.532789283 s
14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at
http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250
14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler
14/05/21 16:08:25 INFO MapOutputTrackerMasterActor:
  MapOutputTrackerActor stopped!
14/05/21 16:08:25 INFO ConnectionManager: Selector thread
  was interrupted!
14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager
  stopped
14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared
14/05/21 16:08:25 INFO BlockManager: BlockManager stopped
14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping
  BlockManagerMaster
14/05/21 16:08:25 INFO BlockManagerMaster:
  BlockManagerMaster stopped
14/05/21 16:08:25 INFO SparkContext: Successfully stopped
  SparkContext
14/05/21 16:08:25 INFO
  RemoteActorRefProvider$RemotingTerminator: Shutting down remote
  daemon.
14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster
with SUCCEEDED
14/05/21 16:08:25 INFO ApplicationMaster: AppMaster
  received a signal.
14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging
  directory .sparkStaging/application_1400696988985_0007
14/05/21 16:08:25 INFO
  RemoteActorRefProvider$RemotingTerminator: Remote daemon shut
  down; proceeding with flushing remote transports.
14/05/21 16:08:25 ERROR ApplicationMaster: Failed to
cleanup staging dir .sparkStaging/application_1400696988985_0007
java.io.IOException: Filesystem closed
 at
  org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)
 at
  org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685)
 at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)
 at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587)
 at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587)
 at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371)
 at
org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386)
 at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

There is nothing about the staging directory themselves that looks
suspicious... 

drwx-- - hduser supergroup 0 2014-05-21 16:06
  /user/hduser/.sparkStaging/application_1400696988985_0007
-rw-r--r-- 3 hduser supergroup 92881278 2014-05-21
  16:06
  /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar
-rw-r--r-- 3 hduser supergroup 118900783 2014-05-21
  16:06

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tathagata Das
You could do

records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }



On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote:

 Hi.
 Firstly I'm a newb (to both Scala  Spark).

 I have a stream, that contains multiple types of records, and I would like
 to create multiple streams based on that

 currently I have it set up as

 class ALL
 class Orange extends ALL
 class Apple extends ALL

 now I can easily add a filter ala
 val records:DStream[ALL] = ...mapper to build the classes off the wire...
 val orangeRecords = records.filter {_.isInstanceOf[Orange]}

 but I would like to have the line be a DStream[Orange] instead of a
 DStream[ALL]
 (So I can access the unique fields in the subclass).

 I'm using 0.9.1 if it matters.

 TIA
 Ian

 --
 Ian Holsman
 i...@holsman.com.au
 PH: + 61-3-9028 8133 / +1-(425) 998-7083



Run Apache Spark on Mini Cluster

2014-05-21 Thread Upender Nimbekar
Hi,
I would like to setup apache platform on a mini cluster. Is there any
recommendation for the hardware that I can buy to set it up. I am thinking
about processing significant amount of data like in the range of few
terabytes.

Thanks
Upender


Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a
sample on Amazon's EC2 or Google compute engine. It's relatively easy (and
cheap) to get started on the cloud before you invest in your own hardware
IMO.




On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar upent...@gmail.comwrote:

 Hi,
 I would like to setup apache platform on a mini cluster. Is there any
 recommendation for the hardware that I can buy to set it up. I am thinking
 about processing significant amount of data like in the range of few
 terabytes.

 Thanks
 Upender



Re: A new resource for getting examples of Spark RDD API calls

2014-05-21 Thread zhen
Great, thanks for that tip. I will update the documents!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p6210.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tobias Pfeiffer
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }

I think a Scala-ish way would be

records.flatMap(_ match {
  case i: Int=
Some(i)
  case _ =
None
})


Re: I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
Thanks Tobias  Tathagata.
these are great.


On Wed, May 21, 2014 at 8:02 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 On Thu, May 22, 2014 at 8:07 AM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
  records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }

 I think a Scala-ish way would be

 records.flatMap(_ match {
   case i: Int=
 Some(i)
   case _ =
 None
 })




-- 
Ian Holsman
i...@holsman.com.au
PH: + 61-3-9028 8133 / +1-(425) 998-7083


yarn-client mode question

2014-05-21 Thread Sophia
As the yarn-client mode,will spark be deployed in the node of yarn? If it is
deployed only in the client,can spark submit the job to yarn?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: yarn-client mode question

2014-05-21 Thread Andrew Or
Hi Sophia,

In yarn-client mode, the node that submits the application can either be
inside or outside of the cluster. This node also hosts the driver
(SparkContext) of the application. All the executors, however, will be
launched on nodes inside the YARN cluster.

Andrew


2014-05-21 18:17 GMT-07:00 Sophia sln-1...@163.com:

 As the yarn-client mode,will spark be deployed in the node of yarn? If it
 is
 deployed only in the client,can spark submit the job to yarn?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: yarn-client mode question

2014-05-21 Thread Sophia
But,I don't understand this point,is it necessary to deploy slave node of
spark in the yarn node? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD union of a window in Dstream

2014-05-21 Thread Tobias Pfeiffer
Hi,

On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 I want to do union of all RDDs in each window of DStream.

A window *is* a union of all RDDs in the respective time interval.

The documentation says a DStream is represented as a sequence of
RDDs. However, data from a certain time interval will always be
contained in *one* RDD, not a sequence of RDDs (AFAIK).

Regards
Tobias


Re: Using Spark to analyze complex JSON

2014-05-21 Thread Tobias Pfeiffer
Hi,

as far as I understand, if you create an RDD with a relational
structure from your JSON, you should be able to do much of that
already today. For example, take lift-json's deserializer and do
something like

  val json_table: RDD[MyCaseClass] = json_data.flatMap(json =
json.extractOpt[MyCaseClass])

then I guess you can use Spark SQL on that. (Something like your
likes[2] query won't work, though, I guess.)

Regards
Tobias


On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Looking forward to that update!

 Given a table of JSON objects like this one:

 {
name: Nick,
location: {
   x: 241.6,
   y: -22.5
},
likes: [ice cream, dogs, Vanilla Ice]
 }

 It would be SUPER COOL if we could query that table in a way that is as
 natural as follows:

 SELECT DISTINCT name
 FROM json_table;

 SELECT MAX(location.x)
 FROM json_table;

 SELECT likes[2] -- Ice Ice Baby
 FROM json_table
 WHERE name = Nick;

 Of course, this is just a hand-wavy suggestion of how I’d like to be able to
 query JSON (particularly that last example) using SQL. I’m interested in
 seeing what y’all come up with.

 A large part of what my team does is make it easy for analysts to explore
 and query JSON data using SQL. We have a fairly complex home-grown process
 to do that and are looking to replace it with something more out of the box.
 So if you’d like more input on how users might use this feature, I’d be glad
 to chime in.

 Nick



 On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust mich...@databricks.com
 wrote:

 You can already extract fields from json data using Hive UDFs.  We have an
 intern working on on better native support this summer.  We will be sure to
 post updates once there is a working prototype.

 Michael


 On Tue, May 20, 2014 at 6:46 PM, Nick Chammas nicholas.cham...@gmail.com
 wrote:

 The Apache Drill home page has an interesting heading: Liberate Nested
 Data.

 Is there any current or planned functionality in Spark SQL or Shark to
 enable SQL-like querying of complex JSON?

 Nick


 
 View this message in context: Using Spark to analyze complex JSON
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn 
cluster manually before you launch application?
Then, no , you don't, just like other yarn application. And it doesn't matter 
it is yarn-client or yarn-cluster mode..


Best Regards,
Raymond Liu

-Original Message-
From: Sophia [mailto:sln-1...@163.com] 
Sent: Thursday, May 22, 2014 10:55 AM
To: u...@spark.incubator.apache.org
Subject: Re: yarn-client mode question

But,I don't understand this point,is it necessary to deploy slave node of spark 
in the yarn node? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tom Graves
It sounds like something is closing the hdfs filesystem before everyone is 
really done with it. The filesystem gets cached and is shared so if someone 
closes it while other threads are still using it you run into this error.   Is 
your application closing the filesystem?     Are you using the event logging 
feature?   Could you share the options you are running with?

Yarn will retry the application depending on how the Application Master attempt 
fails (this is a configurable setting as to how many times it retries).  That 
is probably the second driver you are referring to.  But they shouldn't have 
overlapped as far as both being up at the same time. Is that the case you are 
seeing?  Generally you want to look at why the first application attempt fails.

Tom




On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com 
wrote:
 


I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had 
run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2.  The application 
successfully ran to conclusion but it ultimately failed.  

There were 2 anomalies...

1. ASM reported only that the application was ACCEPTED.  It never
indicated that the application was RUNNING.

14/05/21 16:06:12 INFO yarn.Client: Application report from ASM:
 application identifier: application_1400696988985_0007
 appId: 7
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: default
 appMasterRpcPort: -1
 appStartTime: 1400709970857
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl: 
http://Sleepycat:8088/proxy/application_1400696988985_0007/
 appUser: hduser

Furthermore, it started a second container, running two partly overlapping 
drivers, when it appeared that the application never started.  Each container 
ran to conclusion as explained above, taking twice as long as usual for both to 
complete.  Both instances had the same concluding failure.

2. Each instance failed as indicated by the stderr log, finding that
the filesystem was closed when trying to clean up the staging directories.  

14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863
14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver
14/05/21 16:08:24 INFO Executor: Finished task ID 1453
14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost 
(progress: 2/2)
14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1)
14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks 
have all completed, from pool
14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) 
finished in 0.417 s
14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, 
took 1.532789283 s
14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at 
http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250
14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler
14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted!
14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped
14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared
14/05/21 16:08:25 INFO BlockManager: BlockManager stopped
14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped
14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext
14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED
14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal.
14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1400696988985_0007
14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
14/05/21 16:08:25 ERROR ApplicationMaster: Failed to cleanup staging dir 
.sparkStaging/application_1400696988985_0007
java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)
    at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587)
    at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587)
    at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371)
    at
org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386)
    at

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit,

The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory.  Rather than
OOM'ing, Spark writes the data out to disk in a sorted order and reads it
back from disk later on when it's needed.  That's the job of the
ExternalAppendOnlyMap.

I wouldn't normally expect a conversion from Date to a Joda DateTime to
take significantly more memory.  But since you're using Kryo and classes
should be registered with it, may may have forgotten to register DateTime
with Kryo.  If you don't register a class, it writes the class name at the
beginning of every serialized instance, which for DateTime objects of size
roughly 1 long, that's a ton of extra space and very inefficient.

Can you confirm that DateTime is registered with Kryo?

http://spark.apache.org/docs/latest/tuning.html#data-serialization


On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.com wrote:

 Hi,

 I changed my application to use Joda time instead of java.util.Date and I
 started getting this:

 WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
 time so far)

 What does this mean? How can I fix this? Due to this a small job takes
 forever.

 Mohit.


 P.S.: I am using kyro serialization, have played around with several
 values of sparkRddMemFraction



Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with
gzip -d to see if it works.  I'm guessing there's a corrupted .gz file
somewhere in your path glob.

Andrew


On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote:

 Hi Nick,

 Which version of Hadoop are you using with Spark?  I spotted an issue with
 the built-in GzipDecompressor while doing something similar with Hadoop
 1.0.4, all my Gzip files were valid and tested yet certain files blew up
 from Hadoop/Spark.

 The following JIRA ticket goes into more detail
 https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
 Hadoop releases prior to 1.2.X

 MC




  *Michael Cutler*
 Founder, CTO


 * Mobile: +44 789 990 7847 Email:   mich...@tumra.com mich...@tumra.com
 Web: tumra.com
 http://tumra.com/?utm_source=signatureutm_medium=email *
 *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
 *Registered in England  Wales, 07916412. VAT No. 130595328*


 This email and any files transmitted with it are confidential and may also
 be privileged. It is intended only for the person to whom it is addressed.
 If you have received this email in error, please inform the sender 
 immediately.
 If you are not the intended recipient you must not use, disclose, copy,
 print, distribute or rely on this email.


 On 21 May 2014 14:26, Madhu ma...@madhu.com wrote:

 Can you identify a specific file that fails?
 There might be a real bug here, but I have found gzip to be reliable.
 Every time I have run into a bad header error with gzip, I had a
 non-gzip
 file with the wrong extension for whatever reason.




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
That's a good idea. So you're saying create a SchemaRDD by applying a
function that deserializes the JSON and transforms it into a relational
structure, right?

The end goal for my team would be to expose some JDBC endpoint for analysts
to query from, so once Shark is updated to use Spark SQL that would become
possible without having to resort to using Hive at all.


On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 as far as I understand, if you create an RDD with a relational
 structure from your JSON, you should be able to do much of that
 already today. For example, take lift-json's deserializer and do
 something like

   val json_table: RDD[MyCaseClass] = json_data.flatMap(json =
 json.extractOpt[MyCaseClass])

 then I guess you can use Spark SQL on that. (Something like your
 likes[2] query won't work, though, I guess.)

 Regards
 Tobias


 On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Looking forward to that update!
 
  Given a table of JSON objects like this one:
 
  {
 name: Nick,
 location: {
x: 241.6,
y: -22.5
 },
 likes: [ice cream, dogs, Vanilla Ice]
  }
 
  It would be SUPER COOL if we could query that table in a way that is as
  natural as follows:
 
  SELECT DISTINCT name
  FROM json_table;
 
  SELECT MAX(location.x)
  FROM json_table;
 
  SELECT likes[2] -- Ice Ice Baby
  FROM json_table
  WHERE name = Nick;
 
  Of course, this is just a hand-wavy suggestion of how I’d like to be
 able to
  query JSON (particularly that last example) using SQL. I’m interested in
  seeing what y’all come up with.
 
  A large part of what my team does is make it easy for analysts to explore
  and query JSON data using SQL. We have a fairly complex home-grown
 process
  to do that and are looking to replace it with something more out of the
 box.
  So if you’d like more input on how users might use this feature, I’d be
 glad
  to chime in.
 
  Nick
 
 
 
  On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust 
 mich...@databricks.com
  wrote:
 
  You can already extract fields from json data using Hive UDFs.  We have
 an
  intern working on on better native support this summer.  We will be
 sure to
  post updates once there is a working prototype.
 
  Michael
 
 
  On Tue, May 20, 2014 at 6:46 PM, Nick Chammas 
 nicholas.cham...@gmail.com
  wrote:
 
  The Apache Drill home page has an interesting heading: Liberate Nested
  Data.
 
  Is there any current or planned functionality in Spark SQL or Shark to
  enable SQL-like querying of complex JSON?
 
  Nick
 
 
  
  View this message in context: Using Spark to analyze complex JSON
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
 
 



RE: yarn-client mode question

2014-05-21 Thread Sophia
Thank you 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Nicholas Chammas
Thanks for the suggestions, people. I will try to hone in on which specific
gzipped files, if any, are actually corrupt.

Michael,

I’m using Hadoop 1.0.4, which I believe is the default version that gets
deployed by spark-ec2. The JIRA issue I linked to earlier,
HADOOP-5281https://issues.apache.org/jira/browse/HADOOP-5281,
affects Hadoop 0.18.0 and is fixed in 0.20.0 and is also related to gzip
compression. I know there is some funkiness in how Hadoop is versioned, so
I’m not sure if this issue is relevant to 1.0.4.

Were you able to resolve your issue by changing your version of Hadoop? How
did you do that?

Nick
​


On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com wrote:

 One thing you can try is to pull each file out of S3 and decompress with
 gzip -d to see if it works.  I'm guessing there's a corrupted .gz file
 somewhere in your path glob.

 Andrew


 On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.comwrote:

 Hi Nick,

 Which version of Hadoop are you using with Spark?  I spotted an issue
 with the built-in GzipDecompressor while doing something similar with
 Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
 blew up from Hadoop/Spark.

 The following JIRA ticket goes into more detail
 https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
 Hadoop releases prior to 1.2.X

 MC




  *Michael Cutler*
 Founder, CTO


 * Mobile: +44 789 990 7847 Email:   mich...@tumra.com mich...@tumra.com
 Web: tumra.com
 http://tumra.com/?utm_source=signatureutm_medium=email *
 *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
 *Registered in England  Wales, 07916412. VAT No. 130595328*


 This email and any files transmitted with it are confidential and may
 also be privileged. It is intended only for the person to whom it is
 addressed. If you have received this email in error, please inform the
 sender immediately. If you are not the intended recipient you must not
 use, disclose, copy, print, distribute or rely on this email.


 On 21 May 2014 14:26, Madhu ma...@madhu.com wrote:

 Can you identify a specific file that fails?
 There might be a real bug here, but I have found gzip to be reliable.
 Every time I have run into a bad header error with gzip, I had a
 non-gzip
 file with the wrong extension for whatever reason.




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar
It depends on what stack you want to run. A quick cut:

   - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes)
  - Dual 6 core CPU
  - 64 to 128 GB RAM
  - 3 X 3TB disk (JBOD)
   - Master Node (Name Node, HBase Master,Spark Master)
  - Dual 6 core CPU
  - 64 to 128 GB RAM
  - 2 X 3TB disk (RAID 1+0)
   - Start with a 5 node setup and scale out as needed
   - If your load is Mapreduce over HDFS, then run YRAN
   - If your load is HBase over HDFS, scale depending on the computational
   and storage needs
   - If you are running Spark over HDFS, scale appropriately - you might
   need more memory in the worker nodes
   - In any case, have a topology and the processes that they would run. As
   Soumya suggests, you can prototype at an appropriate  scale using AWS.

Cheers
k/.


On Wed, May 21, 2014 at 5:14 PM, Upender Nimbekar upent...@gmail.comwrote:

 Hi,
 I would like to setup apache platform on a mini cluster. Is there any
 recommendation for the hardware that I can buy to set it up. I am thinking
 about processing significant amount of data like in the range of few
 terabytes.

 Thanks
 Upender



Best way to deploy a jar to spark cluster?

2014-05-21 Thread Min Li
Hi,

I'm quite new and recetly started to try spark. I've setup a single node
spark cluster and followed the tutorials in Quick Start. But I've come
across some issues.

The thing I was trying to do is to try the java api and run it on the
single-node cluster. I followed the Quick Start/A Standalone App in Java
and successfully ran it using maven. But when I was trying to use
./bin/spark-class org.apache.spark.deploy.Client launch to submit the jar,
I found there are a driver and an app running on the cluster. For running
using maven directly, I only saw the app running.

So I was thinking if I could build a jar with all the dependencise in order
to distribute and run it usie just java -cp my.jar MainClass Arguments. But
I came across the Exception in thread main
com.typesafe.config.ConfigException$Missing: No configuration setting found
for key 'akka.version' issue. And I tried to specify the org.apache.spark
as provided in the pom.xml. I can build the jar. But when executing using
the java -cp my.jar, it just report cannot find the spark dependencies. And
using the ./bin/spark-class org.apache.spark.deploy.Client launch method
just go back to have a driver and an app at the same time.

So I'm wondering what's the best way to generate a jar with dependencies
and submit it to the spark cluster as a single app? Could somebody give me
some advice on this? Thank you!

Best Regards,
Min Li


Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tathagata Das
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 /
HDP(?) ? We may be able to reproduce this in that case.

TD

On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote:

 It sounds like something is closing the hdfs filesystem before everyone is
 really done with it. The filesystem gets cached and is shared so if someone
 closes it while other threads are still using it you run into this error.   Is
 your application closing the filesystem? Are you using the event
 logging feature?   Could you share the options you are running with?

 Yarn will retry the application depending on how the Application Master
 attempt fails (this is a configurable setting as to how many times it
 retries).  That is probably the second driver you are referring to.  But
 they shouldn't have overlapped as far as both being up at the same time. Is
 that the case you are seeing?  Generally you want to look at why the first
 application attempt fails.

 Tom



   On Wednesday, May 21, 2014 6:10 PM, Kevin Markey 
 kevin.mar...@oracle.com wrote:


  I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode
 that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2.  The
 application successfully ran to conclusion but it ultimately failed.

 There were 2 anomalies...

 1. ASM reported only that the application was ACCEPTED.  It never
 indicated that the application was RUNNING.

 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM:
  application identifier: application_1400696988985_0007
  appId: 7
  clientToAMToken: null
  appDiagnostics:
  appMasterHost: N/A
  appQueue: default
  appMasterRpcPort: -1
  appStartTime: 1400709970857
  yarnAppState: ACCEPTED
  distributedFinalState: UNDEFINED
  appTrackingUrl:
 http://Sleepycat:8088/proxy/application_1400696988985_0007/http://sleepycat:8088/proxy/application_1400696988985_0007/
  appUser: hduser

 Furthermore, it *started a second container*, running two partly
 *overlapping* drivers, when it appeared that the application never
 started.  Each container ran to conclusion as explained above, taking twice
 as long as usual for both to complete.  Both instances had the same
 concluding failure.

 2. Each instance failed as indicated by the stderr log, finding that the 
 *filesystem
 was closed* when trying to clean up the staging directories.

 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863
 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver
 14/05/21 16:08:24 INFO Executor: Finished task ID 1453
 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on
 localhost (progress: 2/2)
 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1)
 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose
 tasks have all completed, from pool
 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32)
 finished in 0.417 s
 14/05/21 16:08:24 INFO SparkContext: Job finished: count at
 KEval.scala:32, took 1.532789283 s
 14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at
 http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250
 14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler
 14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
 stopped!
 14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted!
 14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped
 14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared
 14/05/21 16:08:25 INFO BlockManager: BlockManager stopped
 14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
 14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped
 14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext
 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
 down remote daemon.
 14/05/21 16:08:25 INFO ApplicationMaster: *finishApplicationMaster with
 SUCCEEDED*
 14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal.
 14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory
 .sparkStaging/application_1400696988985_0007
 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote
 daemon shut down; proceeding with flushing remote transports.
 14/05/21 16:08:25 ERROR *ApplicationMaster: Failed to cleanup staging dir
 .sparkStaging/application_1400696988985_0007*
 *java.io.IOException: Filesystem closed*
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689)
 at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587)
 at
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at