Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg
Hi, I’m new to Spark and trying to test first Spark prog. I’m running SparkPi successfully in yarn-client -mode but when running the same in yarn-mode, app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the outcome is always the same. Any hints what to look for next?

unsubscribe

2014-05-20 Thread Jayaraman Babu
CLASSIFICATION : Public This message has been marked by Jayaraman Babu on Tuesday, May 20, 2014, 9:55:10 AM. The above classification labels were added to the message by. AL ELM Message Classification This e-mail message and all attachments transmitted with it are intended solely for the use

Re: combinebykey throw classcastexception

2014-05-20 Thread Sean Owen
You asked off-list, and provided a more detailed example there: val random = new Random() val testdata = (1 to 1).map(_=(random.nextInt(),random.nextInt())) sc.parallelize(testdata).combineByKey[ArrayBuffer[Int]]( (instant:Int)={new ArrayBuffer[Int]()},

Re: Status stays at ACCEPTED

2014-05-20 Thread sandy . ryza
Hi Jan, How much memory capacity is configured for each node? If you go to the ResourceManager web UI, does it indicate any containers are running? -Sandy On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote: Hi, I’m new to Spark and trying to test first Spark prog.

Re: combinebykey throw classcastexception

2014-05-20 Thread xiemeilong
This issue is turned out cased by version mismatch between driver(0.9.1) and server(0.9.0-cdh5.0.1) just now. Other function works fine but combinebykey before. Thank you very much for your reply. -- View this message in context:

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg
Hi, each node has 4Gig of memory. After total reboot and re-run of SparkPi resource manager shows no running containers and 1 pending container. -jan On 20 May 2014, at 10:24, sandy.r...@cloudera.com sandy.r...@cloudera.com wrote: Hi Jan, How much memory capacity is configured for each

Re: Worker re-spawn and dynamic node joining

2014-05-20 Thread Han JU
Thank you guys for the detailed answer. Akhil, yes I would like to have a try of your tool. Is it open-sourced? 2014-05-17 17:55 GMT+02:00 Mayur Rustagi mayur.rust...@gmail.com: A better way would be use Mesos (and quite possibly Yarn in 1.0.0). That will allow you to add nodes on the fly

question about the license of akka and Spark

2014-05-20 Thread YouPeng Yang
Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem? Regards

Re: question about the license of akka and Spark

2014-05-20 Thread Tathagata Das
Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem?

Re: question about the license of akka and Spark

2014-05-20 Thread Sean Owen
The page says Akka is Open Source and available under the Apache 2 License. It may also be available under another license, but that does not change the fact that it may be used by adhering to the terms of the AL2. The section is referring to commercial support that Typesafe sells. I am not even

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread gaurav.dasgupta
Few more details I would like to provide (Sorry as I should have provided with the previous post): *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2) - Hadoop Version = 2.4.0 (Hortonworks) - I am trying to execute a Spark Streaming program* Because I am using Hortornworks

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg
Still the same. I increased the memory of the node holding resource manager to 5 Gig. I also spotted an HDFS alert of replication factor 3 that I now dropped to the number of data nodes. I also shut all down all services not in use. Still the issue remains. I have noticed following two events

Re: life if an executor

2014-05-20 Thread Koert Kuipers
if they are tied to the spark context, then why can the subprocess not be started up with the extra jars (sc.addJars) already on class path? this way a switch like user-jars-first would be a simple rearranging of the class path for the subprocess, and the messing with classloaders that is

Re: life if an executor

2014-05-20 Thread Koert Kuipers
just for my clarification: off heap cannot be java objects, correct? so we are always talking about serialized off-heap storage? On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives

Ignoring S3 0 files exception

2014-05-20 Thread Laurent T
Hi, I'm trying to get data from S3 using sc.textFile(s3n://+filenamePattern) It seems that if a pattern gives out no result i get an exception like so: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://bucket/20140512/* matches 0 files at

Re: Advanced log processing

2014-05-20 Thread Laurent T
Thanks for the advice. I think you're right. I'm not sure we're going to use HBase but starting by partitioning data into multiple buckets will be a first step. I'll see how it performs on large datasets. My original question though was more like: is there a spark trick i don't know about ?

reading large XML files

2014-05-20 Thread Nathan Kronenfeld
We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that accounts for partition boundaries and the like? Thanks, Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley

Re: issue with Scala, Spark and Akka

2014-05-20 Thread Gerard Maas
This error message says I can't find the config for the akka subsystem. That is typically included in the Spark assembly. First, you need to compile your spark distro, by running sbt/sbt assembly on the SPARK_HOME dir. Then, use the SPARK_HOME (through env or configuration) to point to your

Re: filling missing values in a sequence

2014-05-20 Thread Mohit Jaggi
Xiangrui, Thanks for the pointer. I think it should work...for now I did cook up my own which is similar but on top of spark core APIs. I would suggest moving the sliding window RDD to the core spark library. It seems quite general to me and a cursory look at the code indicates nothing specific to

Re: life if an executor

2014-05-20 Thread Aaron Davidson
One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot

Re: reading large XML files

2014-05-20 Thread Xiangrui Meng
Try sc.wholeTextFiles(). It reads the entire file into a string record. -Xiangrui On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We are trying to read some large GraphML files to use in spark. Is there an easy way to read XML-based files like this that

Evaluating Spark just for Cluster Computing

2014-05-20 Thread pcutil
Hi - We have a use case for batch processing for which we are trying to figure out if Apache Spark would be a good fit or not. We have a universe of identifiers sitting in RDBMS for which we need to go get input data from RDBMS and then pass that input to analytical models that generate some

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Unfortunately, I don't have a bunch of moderately big xml files; I have one, really big file - big enough that reading it into memory as a single string is not feasible. On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote: Try sc.wholeTextFiles(). It reads the entire file

Spark and Hadoop

2014-05-20 Thread pcutil
I'm a first time user and need to try just the hello world kind of program in spark. Now on downloads page, I see following 3 options for Pre-built packages that I can download: - Hadoop 1 (HDP1, CDH3) - CDH4 - Hadoop 2 (HDP2, CDH5) I'm confused which one do I need to download. I need to try

Re: Spark and Hadoop

2014-05-20 Thread Andrew Ash
Hi Puneet, If you're not going to read/write data in HDFS from your Spark cluster, then it doesn't matter which one you download. Just go with Hadoop 2 as that's more likely to connect to an HDFS cluster in the future if you ever do decide to use HDFS because it's the newer APIs. Cheers, Andrew

Re: Spark and Hadoop

2014-05-20 Thread Andras Barjak
You can download any of them, I would go with the latest versions, or just download the source and build it yourself. For experimenting with basic things you can just launch the REPL and start right away in spark local mode not using any hadoop stuff. 2014-05-20 19:43 GMT+02:00 pcutil

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-20 Thread Jacob Eisinger
Howdy Gerard, Yeah, the docker link feature seems to work well for client-server interaction. But, peer-to-peer architectures need more for service discovery. As for you addressing requirements, I don't completely understand what you are asking for... you may also want to check out xip.io .

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja
I was actually able to get this to work. I was NOT setting the classpath properly originally. Simply running java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass and setting yarn-client as the spark master worked for me. Originally I had not put the configuration on the classpath.

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Arun Ahuja
Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM,

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Andrew Or
I'm assuming you're running Spark 0.9.x, because in the latest version of Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path manually. I tested this out on my own YARN cluster and was able to confirm that. In Spark 1.0, SPARK_MEM is deprecated and should not be used.

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote: I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see if there are particular telltale errors on the worker side. We've had

Re: life if an executor

2014-05-20 Thread Koert Kuipers
interesting, so it sounds to me like spark is forced to choose between the ability to add jars during lifetime and the ability to run tasks with user classpath first (which important for the ability to run jobs on spark clusters not under your control, so for the viability of 3rd party spark apps)

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp:// sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://

Spark Streaming using Flume body size limitation

2014-05-20 Thread lemieud
Hi, I am trying to send events to spark streaming via flume. It's working fine up to a certain point. I have problems when the size of the body is over 1020 characters. Basically, up to 1020 it works 1021 through 1024, the event will be accepted and there is no exception, but the channel seems

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja
Yes, we are on Spark 0.9.0 so that explains the first piece, thanks! Also, yes, I meant SPARK_WORKER_MEMORY. Thanks for the hierarchy. Similarly is there some best practice on setting SPARK_WORKER_INSTANCES and spark.default.parallelism? Thanks, Arun On Tue, May 20, 2014 at 3:04 PM, Andrew Or

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld
Thanks, that sounds perfect On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng men...@gmail.com wrote: You can search for XMLInputFormat on Google. There are some implementations that allow you to specify the tag to split on, e.g.:

java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable

2014-05-20 Thread pcutil
This is the first time I'm trying the Spark. I just downloaded and trying the SimpleApp Java program using the maven. I added 2 maven dependencies -- spark-core and scala-library? Even though my program is in java, I was forced to add the scala dependency. Is that really required? Now, I'm able

Re: Evaluating Spark just for Cluster Computing

2014-05-20 Thread Sean Owen
My $0.02: If you are simply reading input records, running a model, and outputting the result, then it's a simple map-only problem and you're mostly looking for a process to baby-sit these operations. Lots of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd choose whatever is

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash
If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote: So we upped the spark.akka.frameSize

Imports that need to be specified in a Spark application jar?

2014-05-20 Thread Shivani Rao
Hello All, I am learning that there are certain imports done by Spark REPL that is used to invoke and run code in a spark shell, that I would have to import specifically if I need the same functionality in a spark jar run by command line. I am getting into a repeated serialization error of an

Re: Setting queue for spark job on yarn

2014-05-20 Thread Sandy Ryza
Hi Ron, What version are you using? For 0.9, you need to set it outside your code with the SPARK_YARN_QUEUE environment variable. -Sandy On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, How does one submit a spark job to yarn and specify a queue? The code

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread jonathan.keebler
Thanks for the suggestion, Andrew. We have also implemented our solution using reduceByKey, but observe the same behavior. For example if we do the following: map1 groupByKey map2 saveAsTextFile Then the stalling will occur during the map1+groupByKey execution. If we do map1 reduceByKey map2

Re: Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-20 Thread anishs...@yahoo.co.in
Thanks Mayur, it is working :) -- Anish Sneh http://in.linkedin.com/in/anishsneh

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Aaron Davidson
So the current stalling is simply sitting there with no log output? Have you jstack'd an Executor to see where it may be hanging? Are you observing memory or disk pressure (df and df -i)? On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler jkeeble...@gmail.comwrote: Thanks for the suggestion,

Spark Performace Comparison Spark on YARN vs Spark Standalone

2014-05-20 Thread anishs...@yahoo.co.in
Hi All I need to analyse performance of Spark YARN vs Spark Standalone Please suggest if we have some pre-published comparison statistics available. TIA -- Anish Sneh http://in.linkedin.com/in/anishsneh

Re: facebook data mining with Spark

2014-05-20 Thread Michael Cutler
Hello Joe, The first step is acquiring some data, either through the Facebook APIhttps://developers.facebook.com/or a third-party service like Datasift https://datasift.com/ (paid). Once you've acquired some data, and got it somewhere Spark can access it (like HDFS), you can then load and

Python, Spark and HBase

2014-05-20 Thread twizansk
Hello, This seems like a basic question but I have been unable to find an answer in the archives or other online sources. I would like to know if there is any way to load a RDD from HBase in python. In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a TableInputFormat class. Is

Using Spark to analyze complex JSON

2014-05-20 Thread Nick Chammas
The Apache Drill http://incubator.apache.org/drill/ home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick -- View this message in context:

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Any tips on how to troubleshoot this? On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote: I’m trying to do a simple count() on a large number of GZipped files in S3. My job is failing with the following message: 14/05/15 19:12:37 WARN scheduler.TaskSetManager:

Unsubscribe

2014-05-20 Thread A.Khanolkar

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Madhu
I have read gzip files from S3 successfully. It sounds like a file is corrupt or not a valid gzip file. Does it work with fewer gzip files? How are you reading the files? - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context:

How to Unsubscribe from the Spark user list

2014-05-20 Thread Nick Chammas
Send an email to this address to unsubscribe from the Spark user list: user-unsubscr...@spark.apache.org Sending an email to the Spark user list itself (i.e. this list) *does not do anything*, even if you put unsubscribe as the subject. We will all just see your email. Nick -- View this

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas
Yes, it does work with fewer GZipped files. I am reading the files in using sc.textFile() and a pattern string. For example: a = sc.textFile('s3n://bucket/2014-??-??/*.gz') a.count() Nick ​ On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote: I have read gzip files from S3

IllegalStateException when creating Job from shell

2014-05-20 Thread Alex Holmes
Hi, I'm trying to work with Spark from the shell and create a Hadoop Job instance. I get the exception you see below because the Job.toString doesn't like to be called until it has been submitted. I tried using the :silent command but that didn't seem to have any impact. scala import

Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia
Unfortunately this is not yet possible. There’s a patch in progress posted here though: https://github.com/apache/spark/pull/455 — it would be great to get your feedback on it. Matei On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote: Hello, This seems like a basic question

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There

any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-20 Thread Francis . Hu
sparkers, Is there a better way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ? Thanks, Francis.Hu