Re: Announcing Calliope releases 0.8.1-GA and 0.9.0-EA

2014-02-05 Thread Rohit Rai
Yes we do track the issues in Github. You can report it in the Calliope EA repohttp://github.com/tuplejump/calliope(just send me your GIthub ID or signup from calliope homepage, [Get Early access linkhttps://docs.google.com/forms/d/1jFTqKnp_13vTjXwy3Zex58X1JKRsFJLLWNhyZ9mQUDg/viewform] and we

Re: Spark Streaming StreamingContext error

2014-02-05 Thread Tathagata Das
Seems like it is not able to find a particular class - org.apache.spark.metrics.sink.MetricsServlet . How are you running your program? Is this an intermittent error? Does it go away if you do a clean compilation of your project and run again? TD On Tue, Feb 4, 2014 at 9:22 AM, soojin

Re: Message processing rate of spark

2014-02-05 Thread Tathagata Das
Hi Sourav, For number of records received per second, you could use something like this to calculate number of records in each batch, and divide it by your batch size. yourKafkaStream.foreachRDD(rdd = { val count = rdd.count println(Current rate = + (count / batchSize) + records / second)

Scheduler delay

2014-02-05 Thread Sourav Chandra
Hi, Fir some of the stages I am getting scheduler delay is ~300-400 ms where for some other taks it is ~100 ms. I am curious to know what factors should i look into debugging scheduler delay problems? How can i fix this? Thanks, -- Sourav Chandra Senior Software Engineer · · · · · · · · · ·

Re: Announcing Calliope releases 0.8.1-GA and 0.9.0-EA

2014-02-05 Thread Rohit Rai
To start the discussion on Calliope core (Spark+Cassandra part of it) becoming a contrib module in Spark (spark-cassandra) I have opened a issue on Spark Jira. https://spark-project.atlassian.net/browse/SPARK-1054 Please let us know what you all feel of it... great/good idea/bad idea/doesn't make

Re: Message processing rate of spark

2014-02-05 Thread Sourav Chandra
Hi Tathagata, How can i find the batch size? Thanks, Sourav On Wed, Feb 5, 2014 at 2:02 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Hi Sourav, For number of records received per second, you could use something like this to calculate number of records in each batch, and divide it

Re: spark streaming questions

2014-02-05 Thread Tathagata Das
Responses inline. On Mon, Feb 3, 2014 at 11:03 AM, Liam Stewart liam.stew...@gmail.comwrote: I'm looking at adding spark / shark to our analytics pipeline and would also like to use spark streaming for some incremental computations, but I have some questions about the suitability of spark

executors in single local mode

2014-02-05 Thread aecc
Hi, I'm trying to execute a stream application using local[4], however I just see one executor in the web UI, shouldn't be more? one executor per worker thread? I'm trying to open connections in all the worker nodes to a mysql database and keep them open until the end of the stream. Do you guys

Spark 0.9.0 support for windows

2014-02-05 Thread goi cto
Hi, Spark 0.8.1 had built in support for windows and sbt directory contains sbt.bat as well as sbt-launch-0.11.3-2.jar. Spark 0.9.0 does not have these files. - How should we run it on windows? -- Eran | CTO

Could not find resource path for Web UI: org/apache/spark/ui/static

2014-02-05 Thread zgalic
Hi, I have the problem running examples in IntelliJ IDEA 13.02. Any idea ? Thanks /usr/lib64/jvm/java-1.7.0-openjdk-1.7.0/bin/java -Didea.launcher.port=7534 -Didea.launcher.bin.path=/home/zgalic/development/idea-IC-133.696/bin -Dfile.encoding=UTF-8 -classpath

Re: Could not find resource path for Web UI: org/apache/spark/ui/static

2014-02-05 Thread goi cto
Not sure it is the same but recall having the same problem yesterday when I run it using mvn. it appears that the mvn package did not complete well since I have to changes something in the pom.xml so when I run the mvn exec:java -Dexec.mainClass=SimpleApp it failed with the same error. HTH Eran

Re: Spark 0.9.0 support for windows

2014-02-05 Thread Stevo Slavić
I'm using it through Git Bash and GitHub's Git Shell. On Wed, Feb 5, 2014 at 3:33 PM, goi cto goi@gmail.com wrote: Hi, Spark 0.8.1 had built in support for windows and sbt directory contains sbt.bat as well as sbt-launch-0.11.3-2.jar. Spark 0.9.0 does not have these files. - How

Re: Spark 0.9.0 support for windows

2014-02-05 Thread Haris Osmanagic
Did you try it with Cygwin? On Wed, Feb 5, 2014 at 4:08 PM, goi cto goi@gmail.com wrote: So what is the equivalent of building it in Git sbt\sbt assembly ? On Wed, Feb 5, 2014 at 5:06 PM, Stevo Slavić ssla...@gmail.com wrote: I'm using it through Git Bash and GitHub's Git Shell.

Re: Spark 0.9.0 support for windows

2014-02-05 Thread goi cto
So what is the equivalent of building it in Git sbt\sbt assembly ? On Wed, Feb 5, 2014 at 5:06 PM, Stevo Slavić ssla...@gmail.com wrote: I'm using it through Git Bash and GitHub's Git Shell. On Wed, Feb 5, 2014 at 3:33 PM, goi cto goi@gmail.com wrote: Hi, Spark 0.8.1 had built in

Re: Spark 0.9.0 support for windows

2014-02-05 Thread goi cto
Cool, got it to work just fine with Git Bash. Thanks! On Wed, Feb 5, 2014 at 5:10 PM, Haris Osmanagic haris.osmana...@gmail.comwrote: Did you try it with Cygwin? On Wed, Feb 5, 2014 at 4:08 PM, goi cto goi@gmail.com wrote: So what is the equivalent of building it in Git sbt\sbt

Problem connecting to Spark Cluster from a standalone Scala program

2014-02-05 Thread Soumya Simanta
I'm running a Spark cluster. (Spark-0.9.0_SNAPSHOT). I connect to the Spark cluster from the spark-shell. I can see the Spark web UI on n001:8080 and it shows that the master is running on spark://n001:7077 However, when I try to connect to it using a standalone Scala program but I'm getting

Re: Spark 0.9.0 support for windows

2014-02-05 Thread Josh Rosen
You could also follow SBT or Maven's instructions for installing under Windows (http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html) and use that to build from cmd.exe or powershell. We no longer provide a sbt.bat file because we had to stop bundling binary artifacts (such as the SBT

Re: What I am missing from configuration?

2014-02-05 Thread Dana Tontea
Hi Matei, Firstly thank you a lot for answer.You are right I'm missing on local the hadoop-client dependency. But in my cluster I deployed the last version of spark-0.9.0 and now on same code I get the next error to sbt package: [warn] :: [warn]

Re: What I am missing from configuration?

2014-02-05 Thread Andrew Ash
Try depending on spark-core_2.10 rather than 2.10.3 -- the third digit was dropped in the maven artifact and I hit this just yesterday as well. Sent from my mobile phone On Feb 5, 2014 10:41 AM, Dana Tontea d...@cylex.ro wrote: Hi Matei, Firstly thank you a lot for answer.You are right I'm

Re: What I am missing from configuration?

2014-02-05 Thread Mark Hamstra
What do you mean by the last version of spark-0.9.0? To be precise, there isn't anything known as spark-0.9.0. What was released recently is spark-0.9.0-incubating, and there is and only ever will be one version of that. If you're talking about a 0.9.0-incubating-SNAPSHOT built locally, then

Using Parquet from an interactive Spark shell

2014-02-05 Thread Uri Laserson
Has anyone tried this? I'd like to read a bunch of Avro GenericRecords from a Parquet file. I'm having a bit of trouble with respect to dependencies. My latest attempt looks like this: export

Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Andrew Ash
I'm assuming you checked all the jars in SPARK_CLASSPATH to confirm that parquet/org/codehaus/jackson/JsonGenerationException.class exists in one of them? On Wed, Feb 5, 2014 at 12:02 PM, Uri Laserson laser...@cloudera.com wrote: Has anyone tried this? I'd like to read a bunch of Avro

Re: Problem connecting to Spark Cluster from a standalone Scala program

2014-02-05 Thread Andrew Ash
When you look in the webui (port 8080) for the master does it list at least one connected worker? On Wed, Feb 5, 2014 at 7:19 AM, Soumya Simanta soumya.sima...@gmail.comwrote: I'm running a Spark cluster. (Spark-0.9.0_SNAPSHOT). I connect to the Spark cluster from the spark-shell. I can see

Have anyone tried to run Spark 0.9 built with Hadoop 2.2 on Mesos 0.15

2014-02-05 Thread elyast
I have sucessfully run spark-shell with Spark 0.9 and CDH4, but when I tried the same with CDH5 I keep getting jvm crash. I am running HDFS cluster with CDH5 (I tried bot version of client 2.2.0-mr1-CDH5.0.0-beta-1 and 2.2.0-CDH5.0.0) In attachment error log from jvm hs_err_pid28948.log

Re: MLLib Sparse Input

2014-02-05 Thread Imran Rashid
Hi, I was hoping to have some discussion about how sparse matrices are represented in MLLib. I noticed a few commits to add a basic MatrixEntry object: https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala I know that

Is HDFS the only possible data source for spark with python?

2014-02-05 Thread cwhiten
I'm evaluating whether Spark would be a good fit in my current streaming data processing pipeline, and I'm just a bit confused about the differentiation between spark and spark streaming. Spark seems to have a mature Python API that I plan on trying out, but Spark Streaming appears to NOT have

Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Uri Laserson
Yes, of course. That class is a jackson class, and I'm not sure why it's being referred to as *parquet.*org.codehaus.jackson.JsonGenerationException. org.codehaus.jackson.JsonGenerationException is on the classpath. But not when it's prefixed by parquet. On Wed, Feb 5, 2014 at 12:06 PM,

Re: MLLib Sparse Input

2014-02-05 Thread Debasish Das
Hi Xiangrui, We are also adding support for sparse format in mllib...if you have a pull request or jira link could you please point to it ? Jblas does not implememt sparse formats the last time I looked at it but colt had sparse formats which could be reused... Thanks. Deb On Jan 31, 2014 11:15

Re: MLLib Sparse Input

2014-02-05 Thread Xiangrui Meng
Hi, I created a JIRA for discussion and track the progress: https://spark-project.atlassian.net/browse/MLLIB-18 Let us move our discussion there. Best, Xiangrui On Wed, Feb 5, 2014 at 3:35 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, We are also adding support for sparse

Re: MLLib Sparse Input

2014-02-05 Thread Xiangrui Meng
Hi Imran, Yes, for better performance in computation, we should use CSC or CSR format. I believe that we are moving towards that direction. But let us first discuss the format for sparse input. Do you mind moving our discussion to the JIRA I created?

Clean up app metadata on worker nodes

2014-02-05 Thread Mingyu Kim
After creating a lot of Spark connections, work/app-* folders in Worker nodes keep getting created without any clean-up being done. This particularly becomes a problem when the Spark driver programs ship jars or files. Is there any way to garbage collect these without manually deleting them?

Re: Clean up app metadata on worker nodes

2014-02-05 Thread Andrew Ash
I'm observing this as well on 0.9.0, with several 10s of GB accumulating in that directory but never being cleaned up. I think this has gotten more pronounced in 0.9.0 as well with large reducers spilling to disk. On Wed, Feb 5, 2014 at 3:46 PM, Mingyu Kim m...@palantir.com wrote: After

Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Uri Laserson
Yep, I did not include that jar in the class path. Now I've got some real errors to try to work through. Thanks! On Wed, Feb 5, 2014 at 3:52 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Uri, Could you try adding the parquet-jackson JAR to your classpath? There may possibly be other

serialization exceptions in spark-shell with 0.9

2014-02-05 Thread Stephen Haberman
Hey, In spark-shell, I'm doing: val s3 = // connection to s3 using aws-java-sdk val mapping: Map[String, String] = { // use s3 to load file, create a plain map } val rdd = sc.loadSomeData().map { // use mapping local var, but *not* s3 } rdd.count() This blows up with Task not serializable

Re: serialization exceptions in spark-shell with 0.9

2014-02-05 Thread Stephen Haberman
val s3 = // connection to s3 using aws-java-sdk Turns out: @transient val s3 = ... Works. I suppose I should have thought of this (...although I really think this did work before :-), but I stumbled across: @transient val sc = org.apache.spark.repl.Main.interp.createSparkContext(); In

Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Uri Laserson
I am cross-posting on the parquet mailing list. Short recap: I am trying to read Parquet data from the spark interactive shell. I have added all the necessary parquet jars to SPARK_CLASSPATH: export

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Uri Laserson
My spark is 0.9.0-SNAPSHOT, built from wherever master was at the time (like a week or two ago). If you're referring to the cloneRecords parameter, it appears to default to true, but even when I add it explicitly, I get the same error. On Wed, Feb 5, 2014 at 7:17 PM, Frank Austin Nothaft

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Frank Austin Nothaft
Uri, Er, yes, it is the cloneRecords, and when I said true, I meant false… Apologies for the misdirection there. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 5, 2014, at 7:44 PM, Uri Laserson laser...@cloudera.com wrote: My spark is

Re: [parquet-dev] Re: Using Parquet from an interactive Spark shell

2014-02-05 Thread Prashant Sharma
That cloneRecords parameter is gone, so either use the released 0.9.0 or the current master. On Thu, Feb 6, 2014 at 9:17 AM, Frank Austin Nothaft fnoth...@berkeley.eduwrote: Uri, Er, yes, it is the cloneRecords, and when I said true, I meant false... Apologies for the misdirection there.

data locality in logs

2014-02-05 Thread Tsai Li Ming
Hi, In older posts on Google Groups, there was mention of checking the logs on “preferred/non-preferred” for data locality. But I can’t seem to find this on 0.9.0 anymore? Has this been changed to “PROCESS_LOCAL” , like this: 14/02/06 13:51:45 INFO TaskSetManager: Starting task 9.0:50 as TID

Re: data locality in logs

2014-02-05 Thread Andrew Ash
If you have multiple executors running on a single node then you might have data that's on the same server but in different JVMs. Just on the same server is NODE_LOCAL, but being in the same JVM is PROCESS_LOCAL. Yes it was changed to be more specific than just preferred/non-preferred. The new

[0.9.0] MEMORY_AND_DISK_SER not falling back to disk

2014-02-05 Thread Andrew Ash
// version 0.9.0 Hi Spark users, My understanding of the MEMORY_AND_DISK_SER persistence level was that if an RDD could fit into memory then it would be left there (same as MEMORY_ONLY), and only if it was too big for memory would it spill to disk. Here's how the docs describe it: