Re: Restarting a Streaming Context

2014-07-09 Thread Tathagata Das
I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop() is called. Also it keeps things consistent with the spark context - once a spark context is stopped it cannot be used any more. You can create a new s

Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-09 Thread Rahul Bhojwani
The discussion is in context for spark 0.9.1 Does MLlib Naive Bayes implementation incorporates Laplase smoothing? Or any other smoothing? Or it doesn't encorporates any smoothing?? Please inform? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of T

Re: Restarting a Streaming Context

2014-07-09 Thread Burak Yavuz
Someone can correct me if I'm wrong, but unfortunately for now, once a streaming context is stopped, it can't be restarted. - Original Message - From: "Nick Chammas" To: u...@spark.incubator.apache.org Sent: Wednesday, July 9, 2014 6:11:51 PM Subject: Restarting a Streaming Context So

Re: Map Function does not seem to be executing over RDD

2014-07-09 Thread Yana Kadiyska
Does this line println("Retuning "+string) from the hash function print what you expect? If you're not seeing that output in the executor log I'd also put some debug statements in "case other", since your match in the "interesting" case is conditioned on if( fieldsList.contains(index)) -- maybe th

Re: executor failed, cannot find compute-classpath.sh

2014-07-09 Thread Yana Kadiyska
https://github.com/apache/spark/pull/1244 I think is what you're looking for On Wed, Jul 9, 2014 at 9:42 PM, cjwang wrote: > The link: > > https://github.com/apache/incubator-spark/pull/192 > > is no longer available. Could someone attach the solution or point me > another location? Thanks. > >

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-09 Thread Yana Kadiyska
class java.io.IOException: Cannot run program "/Users/aris.vlasakakis/Documents/spark-1.0.0/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory By any chance, are your SPARK_HOME directories different on the machine where you're submitting from and the cluster? I'm on

Map Function does not seem to be executing over RDD

2014-07-09 Thread Raza Rehman
Hello every one I am having some problem with a simple Scala/ Spark Code in which I am trying to replaces certain fields in a csv with their hashes class DSV (var line:String="",fieldsList:Seq[Int], var delimiter:String=",") extends Serializable { def hash(s:String):String={

Re: executor failed, cannot find compute-classpath.sh

2014-07-09 Thread cjwang
The link: https://github.com/apache/incubator-spark/pull/192 is no longer available. Could someone attach the solution or point me another location? Thanks. (I am using 1.0.0) C.J. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot

Re: Purpose of spark-submit?

2014-07-09 Thread Andrew Or
I don't see why using SparkSubmit.scala as your entry point would be any different, because all that does is invoke the main class of Client.scala (e.g. for Yarn) after setting up all the class paths and configuration options. (Though I haven't tried this myself) 2014-07-09 9:40 GMT-07:00 Ron Gon

Restarting a Streaming Context

2014-07-09 Thread Nick Chammas
So I do this from the Spark shell: // set things up// ssc.start() // let things happen for a few minutes ssc.stop(stopSparkContext = false, stopGracefully = true) Then I want to restart the Streaming Context: ssc.start() // still in the shell; Spark Context is still alive Which yields: org.

Spark Streaming - What does Spark Streaming checkpoint?

2014-07-09 Thread Yan Fang
Hi guys, I am a little confusing by the checkpointing in Spark Streaming. It checkpoints the intermediate data for the stateful operations for sure. Does it also checkpoint the information of StreamingContext? Because it seems we can recreate the SC from the checkpoint in a driver node failure sce

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here. Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay wr

Re: Some question about SQL and streaming

2014-07-09 Thread Tobias Pfeiffer
Siyuan, I do it like this: // get data from Kafka val ssc = new StreamingContext(...) val kvPairs = KafkaUtils.createStream(...) // we need to wrap the data in a case class for registerAsTable() to succeed val lines = kvPairs.map(_._2).map(s => StringWrapper(s)) val result = lines.transform((rdd,

Re: error in creating external table

2014-07-09 Thread Du Li
Hi, I got an error when trying to create an external table with location on a remote HDFS address. I meant to quickly try out the basic features on spark SQL 1.0-JDBC and so started the thrift server on one terminal and beehive CLI on another. Didn’t do any extra configuration on spark sql, hi

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck of the whole program. In the streaming, there is a stage marked as *"combineByKey at ShuffledDStream.scala:42" *in spark UI. This stage is repeatedly executed. However, during some batches, the number of executors

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay wrote: > Hi Tobias, > > I was using Spark 0.9 before and the master I used was yarn-standalo

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Public service announcement: If you're trying to do some stream processing on Twitter data, you'll need version 3.0.6 of twitter4j . That should work with the Spark Streaming 1.0.0 Twitter library. The latest version of twitter4j, 4.0.2, appears to have breaking cha

spark1.0 principal component analysis

2014-07-09 Thread fintis
Hi, Can anyone please shed more light on the PCA implementation in spark? The documentation is a bit leaving as I am not sure I understand the output. According to the docs, the output is a local matrix with the columns as principal components and columns sorted in descending order of covariance

Pyspark, references to different rdds being overwritten to point to the same rdd, different results when using .cache()

2014-07-09 Thread nimbus
Discovered this in ipynb, and I haven't yet checked to see if it happens elsewhere. here's a simple example: this produces the output: Which is not what I wanted. Alarmingly, if I call .cache() on these rdds, it changes the result and I get what I wanted. which produces: It's very unexpe

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-09 Thread vs
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not require Spark to be installed on all nodes manually. When you submit the Spark assembly jar it will have all its dependencies. YARN will instantiate Spark App Master & Containers based on this jar. -- View this message in co

CoarseGrainedExecutorBackend: Driver Disassociated‏

2014-07-09 Thread Sameer Tilak
Hi,This time instead of manually starting worker node using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT I used start-slaves script on the master node. I also enabled -v (verbose flag) in ssh. Here is the o/p that I see. The log file for to the worker node was not

Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-09 Thread Aris Vlasakakis
Hello everybody, I am trying to figure out how to submit a Spark application from one separate physical machine to a Spark stand alone cluster. I have an application that I wrote in Python that works if I am on the 1-Node Spark server itself, and from that spark installation I run bin/spark-submit

Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute. For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which s

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers On Tue, Jul 8, 2014 at 6:24 PM, Robert James wrote: > I have a Spark app which runs well on local master. I'm now ready to > put it on a cluster. What needs to be ins

Re: Understanding how to install in HDP

2014-07-09 Thread Andrew Or
Hi Abel and Krishna, You shouldn't have to do any manual rsync'ing. If you're using HDP, then you can just change the configs through Ambari. As for passing the assembly jar to all executor nodes, the Spark on YARN code automatically uploads the jar to a distributed cache (HDFS) and all executors

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Krishna, Ok, thank you. I just wanted to make sure that this can be done. Cheers, Nick On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar wrote: > Nick, >AFAIK, you can compile with yarn=true and still run spark in stand > alone cluster mode. > Cheers > > > > On Wed, Jul 9, 2014 at 9:27 AM,

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Krishna Sankar
Nick, AFAIK, you can compile with yarn=true and still run spark in stand alone cluster mode. Cheers On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis wrote: > Hello, > > I am currently learning Apache Spark and I want to see how it integrates > with an existing Hadoop Cluster. > > My cu

Re: Understanding how to install in HDP

2014-07-09 Thread Krishna Sankar
Abel, I rsync the spark-1.0.1 directory to all the nodes. Then whenever the configuration changes, rsync the conf directory. Cheers On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas < acoronadoirue...@gmail.com> wrote: > Hi everybody > > We have hortonworks cluster with many nodes, we wa

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in term

Understanding how to install in HDP

2014-07-09 Thread Abel Coronado Iruegas
Hi everybody We have hortonworks cluster with many nodes, we want to test a deployment of Spark. Whats the recomended path to follow? I mean we can compile the sources in the Name Node. But i don't really understand how to pass the executable jar and configuration to the rest of the nodes. Thank

SPARK_CLASSPATH Warning

2014-07-09 Thread Nick R. Katsipoulakis
Hello, I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop Distribution installed (v2.2.0 without yarn). Due to the fact that the Hadoop Distribution that I am using, uses a list of jars , I do the following changes to the conf/spark-env.sh #!/usr/bin/env bash export HADO

Re: Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Right, the compile error is a casting issue telling me I cannot assign a JavaPairRDD to a JavaPairRDD. It happens in the mapToPair() method. On 9 July 2014 19:52, Sean Owen wrote: > You forgot the compile error! > > > On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores > wrote: > >> Hi every

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak wrote: > Hi Nicholas, > > I am using Spark 1.0 and I use this method to specify the additional jars. > First jar is th

Some question about SQL and streaming

2014-07-09 Thread hsy...@gmail.com
Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan

RE: How should I add a jar?

2014-07-09 Thread Sameer Tilak
Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software

Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text file and save counters in cassandra based on fields in each row.  I am testing the application in two modes:  * Process each row and save the counter i

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
Great. Thank you! Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das wrote: > 1. Multiple output operations are processed in the order they are defined. > That is because by default each one output operation is processed at a > time. This *can* be p

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter "spark.streaming.concurrentJobs" which is by default set to 1. 2. Yes, the outpu

How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j . I’m confused over where and how to add this jar in. SPARK-1089

Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
I am using the Spark Streaming and have the following two questions: 1. If more than one output operations are put in the same StreamingContext (basically, I mean, I put all the output operations in the same class), are they processed one by one as the order they appear in the class? Or they are a

Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the spark shell. (i.e. the line that you type into the spark shell are compiled into classes and then shipped to the executors where they run). Are you able to run other spark examples with closures in the same shell? Michael

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So how do I do the "long-lived server continually satisfying requests" in the Cloudera application? I am very confused by that at this point. On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza wrote: > Spark doesn't currently offer you anything special to do this. I.e. if > you want to write a Spark

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng
We have maven-enforcer-plugin defined in the pom. I don't know why it didn't work for you. Could you try rebuild with maven2 and confirm that there is no error message? If that is the case, please create a JIRA for it. Thanks! -Xiangrui On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar wrote: >

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
Spark doesn't currently offer you anything special to do this. I.e. if you want to write a Spark application that fires off jobs on behalf of remote processes, you would need to implement the communication between those remote processes and your Spark application code yourself. On Wed, Jul 9, 20

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So basically, I have Spark on Yarn running (spark shell) how do I connect to it with another tool I am trying to test using the spark://IP:7077 URL it's expecting? If that won't work with spark shell, or yarn-client mode, how do I setup Spark on Yarn to be able to handle that? Thanks! On Wed,

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived ser

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen wrote: > On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons wrote: > >> Impala is *not* built on map/reduce, though it was built to replace >> Hive, which is map/reduce based. It h

Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDD bodies = lines.map(l -> {Body b = new Bo

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, "AlexanderRiggers" wrote: > By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by > master? Because I am using v 1.0.0 - Alex > > > > -- > View this message in context: > http://

Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Make a JIRA with enough detail to reproduce the error ideally: https://issues.apache.org/jira/browse/SPARK and then even more ideally open a PR with a fix: https://github.com/apache/spark On Wed, Jul 9, 2014 at 5:57 PM, Rah

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Marcelo Vanzin
And if you think that's too many, you can control the number by setting "spark.deploy.retainedApplications". On Wed, Jul 9, 2014 at 12:38 AM, Patrick Wendell wrote: > There isn't currently a way to do this, but it will start dropping > older applications once more than 200 are stored. > > On Wed,

Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Rahul Bhojwani
According to me there is BUG in MLlib Naive Bayes implementation in spark 0.9.1. Whom should I report this to or with whom should I discuss? I can discuss this over call as well. My Skype ID : rahul.bhijwani Phone no: +91-9945197359 Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Scien

Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html Sent from the Apache Spark User List mail

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov
Hi Patrick, I used 1.0 branch, but it was not an official release, I just git pulled whatever was there and compiled. Thanks, Mikhail -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html Sent from

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad > On Jul 9, 2014, at 7:14 AM, Jerry Lam wrote: > > +1 as well for being able to submit jobs programmatically without using shell > script. > > we also experience is

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env

Re: Spark Streaming using File Stream in Java

2014-07-09 Thread Aravind
Hi Akil, It didnt work. Here is the code... package com.paypal; import org.apache.spark.SparkConf; import org.apache.spark.storage.StorageLevel; import org.apache.spark.streaming.api.java.JavaPairInputDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import org.apache.sp

Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Hello, I am currently learning Apache Spark and I want to see how it integrates with an existing Hadoop Cluster. My current Hadoop configuration is version 2.2.0 without Yarn. I have build Apache Spark (v1.0.0) following the instructions in the README file. Only setting the SPARK_HADOOP_VERSION=1

Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Are you able to share the error you're getting? On Wed, Jul 9, 2014 at 9:25 AM, Jerry Lam wrote: > Sandy, I experienced the similar behavior as Koert just mentioned. I don't > understand why there is a difference between using spark-submit and > programmatic execution. Maybe there is something

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
Sandy, I experienced the similar behavior as Koert just mentioned. I don't understand why there is a difference between using spark-submit and programmatic execution. Maybe there is something else we need to add to the spark conf/spark context in order to launch spark jobs programmatically that are

Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr
Hi My setup is to use localMode standalone, Sprak 1.0.0 release version, scala 2.10.4 I made a job that receive serialized object from Kafka broker. The objects are serialized using kryo. The code : val sparkConf = new SparkConf().setMaster("local[4]").setAppName("SparkTest") .set("sp

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode. http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-mode

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available o

Mechanics of passing functions to Spark?

2014-07-09 Thread Seref Arikan
Greetings, The documentation at http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark says: "Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains tha

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using ya

Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkCo

Re: Purpose of spark-submit?

2014-07-09 Thread Andrei
One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. I

Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mod

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass wrote: > Hi, > > Yes . I am caching the RDD's by calling cache method.. > > > May i ask, how you are sha

Re: RDD Cleanup

2014-07-09 Thread premdass
Hi, Yes . I am caching the RDD's by calling cache method.. May i ask, how you are sharing RDD's across jobs in same context? By the RDD name. I tried printing the RDD's of the Spark context, and when the referenceTracking is enabled, i get empty list after the clean up. Thanks, Prem -- Vie

Re: Spark Streaming and Storm

2014-07-09 Thread Dan H.
Xichen_tju, I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 24GBRAM with 3 servers) and was not able to achieve a realistic scale for my business domain needs. Storm is really only a framework, which allows you to put in code to do whatever it is you need for a distri

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Yes, I'm using it to count concurrent users from a kafka stream of events without problems. I'm currently testing it using the local mode but any serialization problem would have already appeared so I don't expect any serialization issue when I deployed to my cluster. 2014-07-09 15:39 GMT+01:00 R

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass wrote: > Hi, > > I using spark 1.0.0 , using Ooyala Job Server, for a low latency query

Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used "hadoop jar" to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM,

Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq, PhD candidatte, KTH, Stockholm.   On Sunday, July 6, 2014 10:20 AM, alessandro finamore wrote: On 5 July 2014 23:0

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-07-09 Thread premdass
Michael, Thanks for the response. Yes, Moving the Case class solved the issue. Thanks, Prem -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-registerAsTable-No-TypeTag-available-Error-tp7623p9183.html Sent from the Apache Spark User List mailing li

Re: how to convert JavaDStream to JavaRDD

2014-07-09 Thread Laeeq Ahmed
Hi, First use foreachrdd and then use collect as DStream.foreachRDD(rdd => {    rdd.collect.foreach({ Also its better to use scala. Less verbose. Regards, Laeeq On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar wrote: Hi Team, Could you please help me to resolve bel

RDD Cleanup

2014-07-09 Thread premdass
Hi, I using spark 1.0.0 , using Ooyala Job Server, for a low latency query system. Basically a long running context is created, which enables to run multiple jobs under the same context, and hence sharing of the data. It was working fine in 0.9.1. However in spark 1.0 release, the RDD's created

Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala   Regards, Laeeq On Friday, May 23, 2014 10:33 AM, Mayur Rustagi wrote: Well its hard to use text data as time of input.  But if

Re: Purpose of spark-submit?

2014-07-09 Thread Robert James
+1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Is MyType serializable? Everything inside the foreachRDD closure has to be serializable. 2014-07-09 14:24 GMT+01:00 RodrigoB : > Hi all, > > I am currently trying to save to Cassandra after some Spark Streaming > computation. > > I call a myDStream.foreachRDD so that I can collect each RDD in th

how to convert JavaDStream to JavaRDD

2014-07-09 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me to resolve below query. My use case is : I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory JavaDStream lines_2 = ssc.textFileStream("hdfs://localhost:9000/user/rajesh/EventsDirectory/"); How to convert JavaDStream result to JavaRDD

Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd => { var someCol = Seq[MyType]() foreach(kv =>{ someC

Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi
val sem = 0 sc.addSparkListener(new SparkListener { override def onTaskStart(taskStart: SparkListenerTaskStart) { sem +=1 } }) sc = spark context Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
The idea is to run a job that use images as input so that each work will process a subset of the images On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi wrote: > RDD can only keep objects. How do you plan to encode these images so that > they can be loaded. Keeping the whole image as a single obje

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Mayur Rustagi
RDD can only keep objects. How do you plan to encode these images so that they can be loaded. Keeping the whole image as a single object in 1 rdd would perhaps not be super optimized. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman
Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatical

Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Mayur Rustagi
Also its far from bug free :) Let me know if you need any help to try it out. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Jul 9, 2014 at 12:58 PM, Akhil Das wrote: > Hi Bertrand, > > We've updated the document

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to "hadoop classpath" that shows what needs to be added should be sufficient. on spark standalone I can launc

Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread gil...@gmail.com
Hello,While trying to run this example below I am getting errors.I have build Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly-Running the example using Spark-shell---

FW: memory question

2014-07-09 Thread michael.lewis
Hi, Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. rather than set spark.cleaner.ttl and have this run periodically, I'd like to run it on demand. The problem with periodic cleaning is that it can remove rdd that we still require (some calcs are short, others very

Docker Scripts

2014-07-09 Thread dmpour23
Hi, Regarding docker scripts I know i can change the base image easily but is there any specific reason why the base image is hadoop_1.2.1 . Why is this prefered to Hadoop2 [HDP2, CDH5]) distributions? Now that amazon supports docker could this replace ec2-scripts? Kind regards Dimitri --

Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi
Hi, Spark does that out of the box for you :) It compresses down the execution steps as much as possible. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev <

Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all, I wondered if you could help me to clarify the next situation: in the classic example val file = spark.textFile("hdfs://...") val errors = file.filter(line => line.contains("ERROR")) As I understand, the data is read in memory in first, and after that filtering is applying. Is it any way

Error using MLlib-NaiveBayes : "Matrices are not aligned"

2014-07-09 Thread Rahul Bhojwani
I am using Naive Bayes in MLlib . Below I have printed log of *model.theta*. after training on train data. You can check that it contains 9 features for 2 class classification. >>print numpy.log(model.theta) [[ 0.31618962 0.16636852 0.07200358 0.05411449 0.08542039 0.17620751 0.03711986

How to run a job on all workers?

2014-07-09 Thread silvermast
Is it possible to run a job that assigns work to every worker in the system? My bootleg right now is to have a spark listener hear whenever a block manager is added and to increase a split count by 1. It runs a spark job with that split count and hopes that it will at least run on the newest worker

Re: TaskContext stageId = 0

2014-07-09 Thread silvermast
Oh well, never mind. The problem is that ResultTask's stageId is immutable and is used to construct the Task superclass. Anyway, my solution now is to use this.id for the rddId and to gather all rddIds using a spark listener on stage completed to clean up for any activity registered for those rdds.

RE: Kryo is slower, and the size saving is minimal

2014-07-09 Thread innowireless TaeYun Kim
Thank you for your response. Maybe that applies to my case. In my test case, The types of almost all of the data are either primitive types, joda DateTime, or String. But I'm somewhat disappointed with the speed. At least it should not be slower than Java default serializer... -Original Messa

Re: Kryo is slower, and the size saving is minimal

2014-07-09 Thread wxhsdp
i'am not familiar with kryo and my opinion may be not right. in my case, kryo only saves about 5% of the original size when dealing with primitive types such as Arrays. i'am not sure whether it is the common case. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

  1   2   >