Re: Setting queue for spark job on yarn

2014-05-21 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi Sandy, Is there a programmatic way? We're building a platform as a service and need to assign it to

MLlib ALS-- Errors communicating with MapOutputTracker

2014-05-21 Thread Sue Cai
Hello, I am currently using MLlib ALS to process a large volume of data, about 1.2 billion Rating(userId, productId, rates) triples. The dataset was sepatated into 4000 partitions for parallized computation on our yarn clusters. I encountered this error Errors communicating with

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Apologies for the premature send. Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread sagi
if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote: Aaron: I see this

ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Hi, I have set up a cluster with Mesos (backed by Zookeeper) with three master and three slave instances. I set up Spark (git HEAD) for use with Mesos according to this manual: http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html Using the spark-shell, I can connect to this

Re: Ignoring S3 0 files exception

2014-05-21 Thread Laurent T
Noone has any idea ?It's really troublesome, it seems like i have no way to catch errors while an action is beeing processed and just ignore it.Here's a bit more details on what i'm doing: JavaRDD a = sc.textFile(s3n://+missingFilenamePattern) JavaRDD b =

RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi, I want to do union of all RDDs in each window of DStream. I found Dstream.union and haven't seen anything like DStream.windowRDDUnion. Is there any way around it? I want to find mean and SD of all values which comes under each sliding window for which I need to union all the RDDs in each

Log analysis

2014-05-21 Thread Shubhabrata
I am new to spark and we are developing a data science pipeline based on spark on ec2. So far we have been using python on spark standalone cluster. However, being a newbie I would like to know more about how can I do debugging (program level) from spark logs (is it stderr ?). I find it a bit

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Madhu
Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah

pyspark.rdd.ResultIterable?

2014-05-21 Thread T.J. Alumbaugh
Hi, I'm noticing a difference between two installations of Spark. I'm pretty sure both are version 0.9.1. One is able to import pyspark.rdd.ResultIterable and the other isn't. Is this an environment problem or do we actually have two different versions of Spark? To be clear, on one box, one can

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, Regarding my comment on closure serialization: I was discussing it with my fellow Sparkers here and I totally overlooked the fact that you need the class files to de-serialize the closures (or whatever) on the workers, so you always need the jar file delivered to the workers in order

Re: Python, Spark and HBase

2014-05-21 Thread twizansk
Thanks Nick and Matei. I'll take a look at the patch and keep you updated. Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Mark Hamstra
After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail

Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by

Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Does anyone know if: ./bin/spark-shell --master yarn is running yarn-cluster or yarn-client by default? Base on source code: ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala if (args.deployMode == cluster args.master.startsWith(yarn)) { args.master = yarn-cluster

Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On Wed, May 21, 2014 at 11:23 AM, yxzhao yxz...@ualr.edu wrote: I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote: first, thanks for your explanations regarding the jar files! No prob :-) On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com wrote: I was discussing it with my fellow Sparkers here and I

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
Thanks Xiangrui, How to check and make sure the data is distributed evenly? Thanks again. On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n6187...@n3.nabble.com wrote: Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
If the RDD is cached, you can check its storage information in the Storage tab of the Web UI. On Wed, May 21, 2014 at 12:31 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, How to check and make sure the data is distributed evenly? Thanks again. On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Andrew Ash
Here's the 1.0.0rc9 version of the docs: https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html I refreshed them with the goal of steering users more towards prebuilt packages than relying on compiling from source plus improving overall formatting and clarity, but not

Re: unsubscribe

2014-05-21 Thread Shangyu Luo
Does any one know how to configure the digest mailing list? For example, I want to receive daily digest, not every 10 messages. Thanks! On Mon, May 19, 2014 at 4:29 PM, Shangyu Luo lsy...@gmail.com wrote: Hi Andrew and Madhu, Thank you for your help here! Will unsubscribe through another

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Andrew, Thanks for the current doc. I'd almost gotten to the point where I thought that my custom code needed to be included in the SPARK_EXECUTOR_URI but that can't possibly be correct. The Spark workers that are launched on Mesos slaves should start with the Spark core jars and then

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice]} It would be SUPER COOL if we could query that table in a way that is as natural as follows: SELECT

tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info]

Inconsistent RDD Sample size

2014-05-21 Thread glxc
I have a graph and am trying to take a random sample of vertices without replacement, using the RDD.sample() method verts are the vertices in the graph val verts = graph.vertices and executing this multiple times in a row verts.sample(false, 1.toDouble/v1.count.toDouble,

RE: tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
Just found this at the top of the log: 17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN o.e.j.u.component.AbstractLifeCycle - FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use build 21-May-2014 17:14:41

Re: Inconsistent RDD Sample size

2014-05-21 Thread Xiangrui Meng
It doesn't guarantee the exact sample size. If you fix the random seed, it would return the same result every time. -Xiangrui On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote: I have a graph and am trying to take a random sample of vertices without replacement, using the

Re: Problem with loading files: Loss was due to java.io.EOFException java.io.EOFException

2014-05-21 Thread hakanilter
The problem is solved after hadoop-core dependency added. But I think there is a misunderstanding about local files. I found this one: Note that if you've connected to a Spark master, it's possible that it will attempt to load the file on one of the different machines in the cluster, so make sure

RE: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. Date: Wed, 21 May 2014 11:07:55 -0700 Subject: Re: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode? From: and...@databricks.com To: user@spark.apache.org The answer is actually

Re: tests that run locally fail when run through bamboo

2014-05-21 Thread Tathagata Das
This do happens sometimes, but it is a warning because Spark is designed try successive ports until it succeeds. So unless a cray number of successive ports are blocked (runaway processes?? insufficient clearing of ports by OS??), these errors should not be a problem for tests passing. On

Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Kevin Markey
I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed. There were 2 anomalies... 1. ASM reported only

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tathagata Das
You could do records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote: Hi. Firstly I'm a newb (to both Scala Spark). I have a stream, that contains multiple types of records, and I would like to

Run Apache Spark on Mini Cluster

2014-05-21 Thread Upender Nimbekar
Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar

Re: A new resource for getting examples of Spark RDD API calls

2014-05-21 Thread zhen
Great, thanks for that tip. I will update the documents! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p6210.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tobias Pfeiffer
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das tathagata.das1...@gmail.com wrote: records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a Scala-ish way would be records.flatMap(_ match { case i: Int= Some(i) case _ = None })

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
Thanks Tobias Tathagata. these are great. On Wed, May 21, 2014 at 8:02 PM, Tobias Pfeiffer t...@preferred.jp wrote: On Thu, May 22, 2014 at 8:07 AM, Tathagata Das tathagata.das1...@gmail.com wrote: records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a

yarn-client mode question

2014-05-21 Thread Sophia
As the yarn-client mode,will spark be deployed in the node of yarn? If it is deployed only in the client,can spark submit the job to yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html Sent from the Apache Spark User

Re: yarn-client mode question

2014-05-21 Thread Andrew Or
Hi Sophia, In yarn-client mode, the node that submits the application can either be inside or outside of the cluster. This node also hosts the driver (SparkContext) of the application. All the executors, however, will be launched on nodes inside the YARN cluster. Andrew 2014-05-21 18:17

Re: yarn-client mode question

2014-05-21 Thread Sophia
But,I don't understand this point,is it necessary to deploy slave node of spark in the yarn node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html Sent from the Apache Spark User List mailing list archive at

Re: RDD union of a window in Dstream

2014-05-21 Thread Tobias Pfeiffer
Hi, On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: I want to do union of all RDDs in each window of DStream. A window *is* a union of all RDDs in the respective time interval. The documentation says a DStream is represented as a sequence of RDDs. However, data from a

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Tobias Pfeiffer
Hi, as far as I understand, if you create an RDD with a relational structure from your JSON, you should be able to do much of that already today. For example, take lift-json's deserializer and do something like val json_table: RDD[MyCaseClass] = json_data.flatMap(json =

RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn cluster manually before you launch application? Then, no , you don't, just like other yarn application. And it doesn't matter it is yarn-client or yarn-cluster mode.. Best Regards, Raymond Liu -Original

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tom Graves
It sounds like something is closing the hdfs filesystem before everyone is really done with it. The filesystem gets cached and is shared so if someone closes it while other threads are still using it you run into this error.   Is your application closing the filesystem?     Are you using the

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark writes the data out to disk in a sorted order and

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
That's a good idea. So you're saying create a SchemaRDD by applying a function that deserializes the JSON and transforms it into a relational structure, right? The end goal for my team would be to expose some JDBC endpoint for analysts to query from, so once Shark is updated to use Spark SQL that

RE: yarn-client mode question

2014-05-21 Thread Sophia
Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Nicholas Chammas
Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier,

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar
It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64

Best way to deploy a jar to spark cluster?

2014-05-21 Thread Min Li
Hi, I'm quite new and recetly started to try spark. I've setup a single node spark cluster and followed the tutorials in Quick Start. But I've come across some issues. The thing I was trying to do is to try the java api and run it on the single-node cluster. I followed the Quick Start/A

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tathagata Das
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 / HDP(?) ? We may be able to reproduce this in that case. TD On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote: It sounds like something is closing the hdfs filesystem before everyone is really done