Task result in Spark Worker Node

2015-03-29 Thread raggy
I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this,

Pregel API Abstraction for GraphX

2015-03-29 Thread Kenny Bastani
Hi all, I have been working hard to make it easier for developers to make community contributions to the Spark GraphX algorithm library. At the core of this I found that the Pregel API is a difficult concept to understand and I think I can help make it better. Can you please review

java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-29 Thread Nick Travers
Hi List, I'm following this example here https://github.com/databricks/learning-spark/tree/master/mini-complete-example with the following: $SPARK_HOME/bin/spark-submit \ --deploy-mode cluster \ --master spark://host.domain.ex:7077 \ --class

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
Thanks for the response. I'll admit I'm rather new to Mesos. Due to the nature of my setup I can't use the Mesos web portal effectively because I'm not connected by VPN, so the local network links from the mesos-master dashboard I SSH tunnelled aren't working. Anyway, I was able to dig up some

Re: Running Spark in Local Mode

2015-03-29 Thread Saisai Shao
Hi, I think for local mode, the number N (N number of thread) basically equals to N number of available cores in ONE executor(worker), not N workers. You could image local[N] as have one worker with N cores. I'm not sure you could set the memory usage for each thread, for Spark the memory is

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread Ted Yu
For Gradle, there are: https://github.com/musketyr/gradle-fatjar-plugin https://github.com/johnrengelman/shadow FYI On Sun, Mar 29, 2015 at 4:29 PM, jay vyas jayunit100.apa...@gmail.com wrote: thanks for posting this! Ive ran into similar issues before, and generally its a bad idea to swap

Re: How to specify the port for AM Actor ...

2015-03-29 Thread Shixiong Zhu
LGTM. Could you open a JIRA and send a PR? Thanks. Best Regards, Shixiong Zhu 2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: I looked @ the 1.3.0 code and figured where this can be added In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is actorSystem =

RE: Port configuration for BlockManagerId

2015-03-29 Thread Manish Gupta 8
Has anyone else faced this issue of running spark-shell (yarn client mode) in an environment with strict firewall rules (on fixed allowed incoming ports)? How can this be rectified? Thanks, Manish From: Manish Gupta 8 Sent: Thursday, March 26, 2015 4:09 PM To: user@spark.apache.org Subject:

转发:How SparkStreaming output messages to Kafka?

2015-03-29 Thread luohui20001
Hi guys, I am using SparkStreaming to receive message from kafka,process it and then send back to kafka. however ,kafka consumer can not receive any messages. Any one share ideas? here is my code: object SparkStreamingSampleDirectApproach { def main(args: Array[String]): Unit =

What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Niranda Perera
Hi, I have noticed in the Spark UI, workers and executors run on several states, ALIVE, LOADING, RUNNING, DEAD etc? What exactly are these states mean and what is the effect it has on working with those executor? ex: whether an executor can not be used in the loading state, etc cheers --

Re: What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Mark Hamstra
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered with the Master, so it isn't quite ready to do useful work. On Mar 29, 2015, at 9:09 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, I have noticed in the Spark UI, workers and executors run on several

Re: RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Thanks to you again, Sean. The thing is that, we persist and count that RDD in hope that all later actions with it won't trigger previous recalculations, it's not really about performance here, it's because recalculations contain UUID generation which should be the same for further actions. I

Re: input size too large | Performance issues with Spark

2015-03-29 Thread Akhil Das
Go through this once, if you haven't read it already. https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote: Hi All, I'm facing performance issues with spark implementation, and was briefly investigating on

Re: Build fails on 1.3 Branch

2015-03-29 Thread Marty Bower
Yes, that worked - thank you very much. On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote: Jenkins build failed too: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull For the moment, you

Re: Shuffle Read and Write

2015-03-29 Thread Akhil Das
Depends on the task that you are doing. I think What is displayed on the webui (attached pic) is for your complete application. Thanks Best Regards On Fri, Mar 27, 2015 at 8:28 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, Is shuffle read and write is for the complete streaming

Re: RDD collect hangs on large input data

2015-03-29 Thread Akhil Das
Don't call .collect if your data size huge, you can simply do a count() to trigger the execution. Can you paste your exception stack trace so that we'll know whats happening? Thanks Best Regards On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I have a simple

Re: Unable to run NetworkWordCount.java

2015-03-29 Thread Akhil Das
You need set your master as local[2], 2 is the minimum number of threads (in localmode) needed to consume and process a Streaming application. Thanks Best Regards On Sun, Mar 29, 2015 at 12:56 PM, mehak.soni mehak.s...@gmail.com wrote: I am trying to run the NetworkWordCount.java in Spark

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-29 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) Also see, whats really

Build fails on 1.3 Branch

2015-03-29 Thread mjhb
I tried pulling the source and building for the first time, but cannot get past the object NoRelation is not a member of package org.apache.spark.sql.catalyst.plans.logical error below on the 1.3 branch. I can build the 1.2 branch. I have tried with both -Dscala-2.11 and 2.10 (after running the

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-29 Thread Ted Yu
Nathan: Please look in log files for any of the following: doCleanupRDD(): case e: Exception = logError(Error cleaning RDD + rddId, e) doCleanupShuffle(): case e: Exception = logError(Error cleaning shuffle + shuffleId, e) doCleanupBroadcast(): case e: Exception =

Re: Build fails on 1.3 Branch

2015-03-29 Thread Ted Yu
Jenkins build failed too: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull For the moment, you can apply the following change: diff --git

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
You may simply pass in JavaSparkContext.sc On 3/29/15 9:25 PM, Vincent He wrote: All, I try Spark SQL with Java, I find HiveContext does not accept JavaSparkContext, is this true? Or any special build of Spark I need to do (I build with Hive and thrift server)? Can we use HiveContext in

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-29 Thread Vincent He
No luck, it does not work, anyone know whether there some special setting for spark-sql cli so we do not need to write code to use spark sql? Anyone have some simple example on this? appreciate any help. thanks in advance. On Sat, Mar 28, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: See

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Vincent He
thanks . It does not work, and can not pass compile as HiveContext constructor does not accept JaveSparkContext and JaveSparkContext is not subclass of SparkContext. Anyone else have any idea? I suspect this is supported now. On Sun, Mar 29, 2015 at 8:54 AM, Cheng Lian lian.cs@gmail.com

Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Vincent He
All, I try Spark SQL with Java, I find HiveContext does not accept JavaSparkContext, is this true? Or any special build of Spark I need to do (I build with Hive and thrift server)? Can we use HiveContext in Java? thanks in advance.

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
I mean JavaSparkContext has a field name sc, whose type is SparkContext. You may pass this sc to HiveContext. On 3/29/15 9:59 PM, Vincent He wrote: thanks . It does not work, and can not pass compile as HiveContext constructor does not accept JaveSparkContext and JaveSparkContext is not

Arguments/parameters in Spark shell scripts?

2015-03-29 Thread Minnow Noir
How does one consume parameters passed to a Scala script via spark-shell -i? 1. If I use an object with a main() method, the println outputs nothing as if not called: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object Test {

Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
persist() completes immediately since it only marks the RDD for persistence. count() triggers computation of rdd, and as rdd is computed it will be persisted. The following transform should therefore only start after count() and therefore after the persistence completes. I think there might be

Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
Hi Burak, Unfortunately, I am expected to do my work in HDInsight environment which only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace it with Spark 1.3. I think the problem I am observing is caused by kmeans|| initialization step. I will open another thread to discuss

Re: converting DStream[String] into RDD[String] in spark streaming [I]

2015-03-29 Thread Deenar Toraskar
Sean Thank you very much for your response. I have a requirement run a function only over the new inputs in a Spark Streaming sliding window, i.e. the latest batch of events only, do I just get a new Dstream using the slide duration equal to the window duration ? such as val sparkConf = new

RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut

Re: FW: converting DStream[String] into RDD[String] in spark streaming [I]

2015-03-29 Thread Sean Owen
You're just describing the operation of Spark Streaming at its simplest, without windowing. You get non-overlapping RDDs of the most recent data each time. On Sun, Mar 29, 2015 at 8:44 AM, Deenar Toraskar deenar.toras...@gmail.com wrote: Sean Thank you very much for your response. I have a

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-29 Thread Alexey Zinoviev
I figured out that Logging is a DeveloperApi and it should not be used outside Spark code, so everything is fine now. Thanks again, Marcelo. On 24 Mar 2015, at 20:06, Marcelo Vanzin van...@cloudera.com wrote: From the exception it seems like your app is also repackaging Scala classes

Untangling dependency issues in spark streaming

2015-03-29 Thread Neelesh
Hi, My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but spark uses 4.2.6 , and I believe thats what's causing the following error. I've tried setting spark.executor.userClassPathFirst spark.driver.userClassPathFirst to true in the config, but that does not solve it either.

Unable to run NetworkWordCount.java

2015-03-29 Thread mehak.soni
I am trying to run the NetworkWordCount.java in Spark streaming examples. I was able to run it using run-example. I was now trying to run the same code from an app I created. This is the code- it looks pretty much similar to the existing code: import scala.Tuple2; import

kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It uses the kmeans|| initialization algorithm which supposedly to be a faster version of kmeans++ and

Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
I don't think you can guarantee that there is no recomputation. Even if you persist(), you might lose the block and have to recompute it. You can persist your UUIDs to storage like HDFS. They won't change then of course. I suppose you still face a much narrower problem, that the act of computing

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread Timothy Chen
I left a comment on your stackoverflow earlier. Can you share what's the output in your stderr log from your Mesos task? It Can be found in your Mesos UI and going to its sandbox. Tim Sent from my iPhone On Mar 29, 2015, at 12:14 PM, seglo wla...@gmail.com wrote: The latter part of this

Running Spark in Local Mode

2015-03-29 Thread FreePeter
Hi, I am trying to use Spark for my own applications, and I am currently profiling the performance with local mode, and I have a couple of questions: 1. When I set spark.master local[N], it means the will use up to N worker *threads* on the single machine. Is this equivalent to say there are N

Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
Mesosphere did a great job on simplifying the process of running Spark on Mesos. I am using this guide to setup a development Mesos cluster on Google Cloud Compute. https://mesosphere.com/docs/tutorials/run-spark-on-mesos/ I can run the example that's in the guide by using spark-shell (finding

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread seglo
The latter part of this question where I try to submit the application by referring to it on HDFS is very similar to the recent question Spark-submit not working when application jar is in hdfs

Re: Build fails on 1.3 Branch

2015-03-29 Thread Reynold Xin
I pushed a hotfix to the branch. Should work now. On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower sp...@mjhb.com wrote: Yes, that worked - thank you very much. On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote: Jenkins build failed too:

Re: Understanding Spark Memory distribution

2015-03-29 Thread Ankur Srivastava
Hi Wisely, I am running on Amazon EC2 instances so I can not doubt the hardware. Moreover my other pipelines run successfully except for this which involves Broadcasting large object. My spark-en.sh setting are: SPARK_MASTER_IP=MASTER-IP SPARK_LOCAL_IP=LOCAL-IP SPARK_DRIVER_MEMORY=24g

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-29 Thread hbogert
/20150329-232522-84118794-5050-18181-/executors/5/runs/latest/stderr -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-run-spark-submit-with-an-application-jar-on-a-Mesos-cluster-tp22277p22280.html Sent from the Apache Spark User List mailing list archive

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-29 Thread Eran Medan
Hi Sean, I think your point about the ETL costs are the wining argument here. but I would like to see more research on the topic. What I would like to see researched - is ability to run a specialized set of common algorithms in fast-local-mode just like a compiler optimizer can decide to inline

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-29 Thread Eran Medan
Thanks Michael! Can you please point me to the docs / source location for that automatic casting? I'm just using it to extract the data and put it in a Map[String, Any] (long story on the reason...) so I think the casting rules won't know what to cast it to... right? I guess I can have the JSON /

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
thanks for posting this! Ive ran into similar issues before, and generally its a bad idea to swap the libraries out and pray fot the best, so the shade functionality is probably the best feature. Unfortunately, im not sure how well SBT and Gradle support shading... how do folks using next gen

Re: Spark-submit not working when application jar is in hdfs

2015-03-29 Thread dilm
Made it work by using yarn-cluster as master instead of local. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22281.html Sent from the Apache Spark User List mailing list archive at