Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-18 Thread Raghavendra Pandey
If you are running spark in local mode, executor parameters are not used as there is no executor. You should try to set corresponding driver parameter to effect it. On Mon, Jan 19, 2015, 00:21 Sean Owen so...@cloudera.com wrote: OK. Are you sure the executor has the memory you think? -Xmx24g in

Re: running a job on ec2 Initial job has not accepted any resources

2015-01-18 Thread Akhil Das
Just make sure both versions of spark are same (the one from where you are submitting the job, and the one to which you are submitting the job). Another reason would be firewall issues if you are submitting the job from another network/remote machine. Thanks Best Regards On Sun, Jan 18, 2015 at

Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
Hi Sean, Thanks for your advice, a normal 'val' will suffice. But will it be serialized and transferred every batch and every partition? That's why broadcast exists, right? For now I'm going to use 'val', but I'm still looking for a broadcast-way solution. On Sun, Jan 18, 2015 at 5:36 PM, Sean

Re: maven doesn't build dependencies with Scala 2.11

2015-01-18 Thread Ted Yu
bq. there was no 2.11 Kafka available That's right. Adding external/kafka module resulted in: [ERROR] Failed to execute goal on project spark-streaming-kafka_2.11: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka_2.11:jar:1.3.0-SNAPSHOT: Could not find artifact

RE: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-18 Thread Shao, Saisai
Hi Jeff, From my understanding it seems more like a bug, since JavaDStreamLike is used for Java code, return a Scala DStream is not reasonable. You can fix this by submitting a PR, or I can help you to fix this. Thanks Jerry From: Jeff Nadler [mailto:jnad...@srcginc.com] Sent: Monday, January

RE: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Cheng, Hao
It seems the netty jar works with an incompatible method signature. Can you check if there different versions of netty jar in your classpath? From: Walrus theCat [mailto:walrusthe...@gmail.com] Sent: Sunday, January 18, 2015 3:37 PM To: user@spark.apache.org Subject: Re: SparkSQL 1.2.0 sources

Re: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Ted Yu
NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty 3.5.4 https://github.com/netty/netty/blob/netty-3.5.4.Final/src/main/java/org/jboss/netty/channel/socket/nio/NioWorkerPool.java If there is a netty jar in the classpath older than the above release, you would see the

Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent changes to EventLoggingListener. You can reproduce this bug on a fresh Spark checkout by running ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir where

Re: No Output

2015-01-18 Thread Deep Pradhan
The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means

Re: running a job on ec2 Initial job has not accepted any resources

2015-01-18 Thread Grzegorz Dubicki
Hi mehrdad, I seem to have the same issue as you wrote about here. Did you manage to resolve it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-a-job-on-ec2-Initial-job-has-not-accepted-any-resources-tp20607p21218.html Sent from the Apache Spark

Re: Avoid broacasting huge variables

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g.

RE: Directory / File Reading Patterns

2015-01-18 Thread Bob Tiernay
Also, I used the following pattern to extract information from a file path and add it to the output of a transformation: https://gist.github.com/btiernay/1ad5e3dea08904fe07d9 You may find it useful as well. Cheers, Bob From: btier...@hotmail.com To: so...@cloudera.com;

Re: Reducer memory exceeded

2015-01-18 Thread Sean Owen
I think the problem is that you have a single object that is larger than 2GB and so fails to serialize to a byte array. I think it is best not to design it this way as you can't parallelize combining maps. You could go all the way to emit key value pairs and reduceByKey. There are solutions

RE: Directory / File Reading Patterns

2015-01-18 Thread Bob Tiernay
You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if you are using Spark SQL. It should be noted that this is possible with HiveContext today. Cheers, Bob Date: Sun, 18 Jan 2015 08:47:06 + Subject: Re: Directory / File Reading Patterns From: so...@cloudera.com

Reducer memory exceeded

2015-01-18 Thread octavian.ganea
Hi, Please help me with this problem. I would really appreciate your help ! I am using spark 1.2.0. I have a map-reduce job written in spark in the following way: val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w, localTrainingData, numeratorCtEta, numitorCtEta, regularizer,

ExceptionInInitializerError when using a class defined in REPL

2015-01-18 Thread Kevin (Sangwoo) Kim
Hi experts, I'm getting ExceptionInInitializerError when using a class defined in REPL. Code is something like this: case class TEST(a: String) sc.textFile(~~~).map(TEST(_)).count The code above used to works well until yesterday, but suddenly for some reason it doesn't work with the error.

Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Hemanth Yamijala
Hi, I am trying to find where Spark persists RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDDString lines =

Re: No Output

2015-01-18 Thread Deep Pradhan
Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat,

How to create distributed matrixes from hive tables.

2015-01-18 Thread guxiaobo1982
Hi, We have large datasets with data format for Spark MLLib matrix, but there are pre-computed by Hive and stored inside Hive, my question is can we create a distributed matrix such as IndexedRowMatrix directlly from Hive tables, avoiding reading data from Hive tables and feed them into an

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g.

Re: Avoid broacasting huge variables

2015-01-18 Thread Sean Owen
Why do you say it does not work? The singleton pattern works the same as ever. It is not a pattern that involves Spark. On Jan 18, 2015 12:57 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple

Re: Join DStream With Other Datasets

2015-01-18 Thread Sean Owen
I think that this problem is not Spark-specific since you are simply side loading some data into memory. Therefore you do not need an answer that uses Spark. Simply load the data and then poll for an update each time it is accessed? Or some reasonable interval? This is just something you write in

Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
Hi, After some experiments, there're three methods that work in this 'join DStream with other dataset which is updated periodically'. 1. Create an RDD in transform operation val words = ssc.socketTextStream(localhost, ).flatMap(_.split(_)) val filtered = words transform { rdd = val spam =

Re: Directory / File Reading Patterns

2015-01-18 Thread Sean Owen
I think that putting part of the data (only) in a filename is an anti-pattern, but we sometimes have to play these where they lie. You can list all the directory paths containing the CSV files, map them each to RDDs with textFile, transform the RDDs to include info from the path, and then simply

Re: No Output

2015-01-18 Thread Akhil Das
You can try increasing the parallelism, can you be more specific about the task that you are doing? May be pasting the piece of code would help. On 18 Jan 2015 13:22, Deep Pradhan pradhandeep1...@gmail.com wrote: The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-18 Thread Sean Owen
OK. Are you sure the executor has the memory you think? -Xmx24g in its command line? It may be that for some reason your job is reserving an exceptionally large amount of non-heap memory. I am not sure that's to be expected with the ALS job though. Even if the settings work, considering using the

Re: Maven out of memory error

2015-01-18 Thread Sean Owen
Oh: are you running the tests with a different profile setting than what the last assembly was built with? this particular test depends on those matching. Not 100% sure that's the problem, but a good guess. On Sat, Jan 17, 2015 at 4:54 PM, Ted Yu yuzhih...@gmail.com wrote: The test passed here:

Re: maven doesn't build dependencies with Scala 2.11

2015-01-18 Thread Sean Owen
I could be wrong, but I thought this was on purpose. At the time it was set up, there was no 2.11 Kafka available? or one of its dependencies wouldn't work with 2.11? But I'm not sure what the OP means by maven doesn't build Spark's dependencies because Ted indicates it does, and of course you

Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-18 Thread Nicholas Chammas
Nathan, I posted a bunch of questions for you as a comment on your question http://stackoverflow.com/q/28002443/877069 on Stack Overflow. If you answer them (don't forget to @ping me) I may be able to help you. Nick On Sat Jan 17 2015 at 3:49:54 PM gen tang gen.tan...@gmail.com wrote: Hi,

R: Spark Streaming with Kafka

2015-01-18 Thread Eduardo Alfaia
I have the same issue. - Messaggio originale - Da: Rasika Pohankar rasikapohan...@gmail.com Inviato: ‎18/‎01/‎2015 18:48 A: user@spark.apache.org user@spark.apache.org Oggetto: Spark Streaming with Kafka I am using Spark Streaming to process data received through Kafka. The Spark

Re: Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Sean Owen
These will be under the working directory of the YARN container running the executor. I don't have it handy but think it will also be a spark-local or similar directory. On Sun, Jan 18, 2015 at 2:50 PM, Hemanth Yamijala yhema...@gmail.com wrote: Hi, I am trying to find where Spark persists

Re: Row similarities

2015-01-18 Thread Pat Ferrel
Right, done with matrix blocks. Seems like a lot of duplicate effort. but that’s the way of OSS sometimes. I didn’t see transpose in the Jira. Are there plans for transpose and rowSimilarity without transpose? The latter seems easier than columnSimilarity in the general/naive case. Thresholds

Spark Streaming with Kafka

2015-01-18 Thread Rasika Pohankar
I am using Spark Streaming to process data received through Kafka. The Spark version is 1.2.0. I have written the code in Java and am compiling it using sbt. The program runs and receives data from Kafka and processes it as well. But it stops receiving data suddenly after some time( it has run for

Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Ganon Pierce
I posted about the Application WebUI error (specifically application WebUI not the master WebUI generally) and have spent at least a few hours a day for over week trying to resolve it so I’d be very grateful for any suggestions. It is quite troubling that I appear to be the only one

Re: Maven out of memory error

2015-01-18 Thread Ted Yu
Yes. That could be the cause. On Sun, Jan 18, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: Oh: are you running the tests with a different profile setting than what the last assembly was built with? this particular test depends on those matching. Not 100% sure that's the problem, but