Shared variable in Spark Streaming

2014-08-08 Thread Soumitra Kumar
Hello, I want to count the number of elements in the DStream, like RDD.count() . Since there is no such method in DStream, I thought of using DStream.count and use the accumulator. How do I do DStream.count() to count the number of elements in a DStream? How do I create a shared variable in

Re: Spark Streaming on Yarn Input from Flume

2014-08-08 Thread Hari Shreedharan
Do you see anything suspicious in the logs? How did you run the application? On Thu, Aug 7, 2014 at 10:02 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote: Hi~ I run a spark streaming app to receive data from flume event.When I run on standalone,Spark Streaming can receive the Flume event

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Chunnan Yao
I think the eigenvalues and eigenvectors you are talking about is that of M^T*M or M*M^T, if we get M=U*s*V^T as SVD. What I want is to get eigenvectors and eigenvalues of M itself. Is this my misunderstanding of linear algebra or the API? [image: M^{*} M = V \Sigma^{*} U^{*}\, U \Sigma V^{*} = V

Re: Partitioning: Where do my partitions go?

2014-08-08 Thread losmi83
I'd appreciate if anyone could confirm whether this is a bug or intended behavior of Spark. Thanks, Milos -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Where-do-my-partitions-go-tp11635p11766.html Sent from the Apache Spark User List mailing

Re: Where do my partitions go?

2014-08-08 Thread Xiangrui Meng
They are two different RDDs. Spark doesn't guarantee that the first partition of RDD1 and the first partition of RDD2 will stay in the same worker node. If that is the case, if you have 1000 single-partition RDDs the first worker will have very heavy load. -Xiangrui On Thu, Aug 7, 2014 at 2:20

Re: Shared variable in Spark Streaming

2014-08-08 Thread Tathagata Das
Do you mean that you want a continuously updated count as more events/records are received in the DStream (remember, DStream is a continuous stream of data)? Assuming that is what you want, you can use a global counter var globalCount = 0L dstream.count().foreachRDD(rdd = { globalCount +=

Re: Shared variable in Spark Streaming

2014-08-08 Thread Mayur Rustagi
You can also use Update by key interface to store this shared variable. As for count you can use foreachRDD to run counts on RDD then store that as another RDD or put it in updatebykey Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Spark hang with mesos after many task failures

2014-08-08 Thread Xu Zhongxing
My spark program just hang there after mesos saw many task failures. I read an earlier post https://groups.google.com/forum/#!msg/spark-users/RThPAN-5zX8/vuxXp27P5-MJ which says that we should set mesos's failover_timeout, otherwise mesos would have a rather long default failover timeout.

Re: Low Performance of Shark over Spark.

2014-08-08 Thread Mayur Rustagi
Hi Vinay, First of all you should probably migrate to sparksql as shark is not actively supported anymore. The 100x benefit entails in-memory caching DAG, since you are not able to cache the performance can be quite low.. Alternatives you can explore 1. Use parquet as storage which will push down

Re: Where do my partitions go?

2014-08-08 Thread losmi83
Thanks for you answer. But the same problem appears if you start from one common RDD: val partitioner = new HashPartitioner(10) val dummyJob = sc.parallelize(0 until 10).map(x = (x,x)) dummyJob.partitionBy(partitioner).foreach { case (ind, x) = println(Dummy1 - Id = +

Unable to access worker web UI or application UI (EC2)

2014-08-08 Thread sparkuser2345
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not the worker web UIs or the application detail UI (Server not found). I added the following inbound rule to the ElasticMapreduce-slave security group but it didn't help: Type = All TCP Port range = 0 - 65535 Source = My

[GraphX] Is it normal to shuffle write 15GB while the data is only 30MB?

2014-08-08 Thread Bin
Hi All, I am running a customized label propagation using Pregel. After a few iterations, the program becomes slow and wastes a lot of time in mapPartitions (at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And the amount of shuffle write reaches 15GB, while the size of

Time series in Spark / Spark Streaming

2014-08-08 Thread PiR
I would like to use Spark (and Spark streaming) to do some processing on time series. I have text files with many lines where each line contains a timestamp and values associated with this timestamp. Each timestamp is unique. Timestamps are ordered. I am considering them as keys. The lines in my

Re: Low Performance of Shark over Spark.

2014-08-08 Thread vinay.kashyap
Hi Mayur, I cannot use spark sql in this case because many of the aggregations are not supported yet. Hence I migrated back to use Shark as all those aggregation functions are supported.

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Chitturi Padma
Hi, I have similar problem. I need matrix operations such as dot product , cross product , transpose, matrix multiplication to be performed on Spark. Does spark has inbuilt API to support these? I see matrix factorization implementation in mlib. On Fri, Aug 8, 2014 at 12:38 PM, yaochunnan [via

Re: JVM Error while building spark

2014-08-08 Thread Rasika Pohankar
Hello, Thanks a lot! I installed Maven 3.2.2 and the building worked with maven. But I also got the prebuilt version to run. So I will be using the prebuilt version. Is there any downside to using the prebuilt version? Also could you tell me what I would need to do if I had to build it without

How to detect mesos slave down in Spark programs

2014-08-08 Thread Xu Zhongxing
When I run a spark program with mesos, if all slaves are down, the program just hang there. I could check the Mesos UI or logs to know that all slaves are down. Is there a way to detect this situation in Spark programs automatically? Or is there a way to detect this situation programmably?

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-08 Thread chutium
try to add following jars in classpath

Re: Unable to access worker web UI or application UI (EC2)

2014-08-08 Thread Akhil Das
Could be some issues with the way you access it. If you are able to see http://master-ip-public-ip:8080 then ideally the application UI (if you havent changed the default) will be available on http://master-public-ip:4040, Similarly, you can see the worker UIs at http://worker-public-ip:8081

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Sean Owen
The SVD does not in general give you eigenvalues of its input. Are you just trying to access the U and V matrices? they are also returned in the API. But they are not the eigenvectors of M, as you note. I don't think MLlib has anything to help with the general eigenvector problem. Maybe you can

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-08 Thread Nick Pentreath
By the way, for anyone using elasticsearch-hadoop, there is a fix for this here: https://github.com/elasticsearch/elasticsearch-hadoop/issues/239 Ryan - using the nightly snapshot build of 2.1.0.BUILD-SNAPSHOT fixed this for me. On Thu, Aug 7, 2014 at 3:58 PM, Nick Pentreath

Job ACL's on SPark

2014-08-08 Thread Manoj kumar
Hi Team, Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s. Where I can restrict who can submit the Job to the Spark Master service. In our hadoop cluster we enabled Job ACL;s by using job queues and restricting the default queues and have Fair scheduler for managing the

Re: KMeans Input Format

2014-08-08 Thread AlexanderRiggers
Thanks for your answers. I added some lines to my code and it went through, but I get a error message for my compute cost function now... scala val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(driver, 192.168.0.33, 49242, 0)

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-08 Thread salemi
Thank you. I will look into setting up a hadoop hdfs node. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11790.html Sent from the Apache Spark User List mailing

Re: Spark: Could not load native gpl library

2014-08-08 Thread Jikai Lei
Thanks. I tried this option, but still got the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Could-not-load-native-gpl-library-tp11743p11791.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Shared variable in Spark Streaming

2014-08-08 Thread Soumitra Kumar
I want to keep track of the events processed in a batch. How come 'globalCount' work for DStream? I think similar construct won't work for RDD, that's why there is accumulator. On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you mean that you want a

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-08 Thread salemi
Is it possible to keep the events in memory rather than pushing them out to the file system? Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11792.html Sent

Re: Spark: Could not load native gpl library

2014-08-08 Thread Jikai Lei
Thanks Andrew. Actually my job did not use any data in .lzo format. Here is the program itself: import org.apache.spark._ import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.classification.LogisticRegressionWithSGD object Test { def main(args: Array[String]) { val

Re: KMeans Input Format

2014-08-08 Thread Sean Owen
(-incubator, +user) It's a method of KMeansModel, not KMeans. On first glance it looks like model should be a KMeansModel, but Scala says it's not. The problem is... val model = new KMeans() .setInitializationMode(k-means||) .setK(2) .setMaxIterations(2) .setEpsilon(1e-4) .setRuns(1) .run(train)

Re: questions about MLLib recommendation models

2014-08-08 Thread Jay Hutfles
Ah, that makes perfect sense. Thanks for the concise explanation! On Thu, Aug 7, 2014 at 9:14 PM, Xiangrui Meng men...@gmail.com wrote: ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() The code doesn't work because the

Minimum Split of Hadoop RDD

2014-08-08 Thread Deep Pradhan
Hi, I am using a single node Spark cluster on HDFS. When I was going through the SparkPageRank.scala code, I came across the following line: *val lines = ctx.textFile(args(0), 1)* where, args(0) is the path of the input file from the HDFS, and the second argument is the minimum split of Hadoop

error with pyspark

2014-08-08 Thread Baoqiang Cao
Hi There I ran into a problem and can’t find a solution. I was running bin/pyspark ../python/wordcount.py The wordcount.py is here: import sys from operator import add from pyspark import SparkContext datafile = '/mnt/data/m1.txt' sc =

Re: Use SparkStreaming to find the max of a dataset?

2014-08-08 Thread bumble123
Do you know how I might do a percentile then? I can't figure out how to order my data and count it so that I can calculate and get to the percentile. -- View this message in context:

Re: Use SparkStreaming to find the max of a dataset?

2014-08-08 Thread bumble123
Also, I tried that code and I keep getting this error: console:26: error: overloaded method value max with alternatives: (x$1: Double,x$2: Double)Double and (x$1: Float,x$2: Float)Float and (x$1: Long,x$2: Long)Long and (x$1: Int,x$2: Int)Int cannot be applied to (String, Int.type)

Re: scopt.OptionParser

2014-08-08 Thread SK
i was using sbt package when I got this error. Then I switched to using sbt assembly and that solved the issue. To run sbt assembly, you need to have a file called plugins.sbt in the project root/project directory and it has the following line: addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.2)

Re: scopt.OptionParser

2014-08-08 Thread Xiangrui Meng
Thanks for posting the solution! You can also append `% provided` to the `spark-mllib` dependency line and remove `spark-core` (because spark-mllib already depends on spark-core) to make the assembly jar smaller. -Xiangrui On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote: i was

Re: Use SparkStreaming to find the max of a dataset?

2014-08-08 Thread bumble123
Just realized that my dStream was being inputted as a String stream. I'm trying to use the textSocketStream but the .toInt method doesn't seem to be working. Is there another way to get a numerical stream from a socket? -- View this message in context:

Custom Transformations in Spark

2014-08-08 Thread Jeevak Kasarkod
Is it possible to create custom transformations in Spark? For example data security transforms such as encrypt and decrypt. Ideally its something one would like to reuse across Spark streaming, Spark SQL and Spark.

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Li Pu
@Miles, eigen-decomposition with asymmetric matrix doesn't always give real-value solutions, and it doesn't have the nice properties that symmetric matrix holds. Usually you want to symmetrize your asymmetric matrix in some way, e.g. see

Spark Streaming worker underutilized?

2014-08-08 Thread maddenpj
jI currently have a 4 node spark setup, 1 master and 3 workers running in spark standalone mode. I am currently stress testing a spark application I wrote that reads data from kafka and puts it into redshift. I'm pretty happy with the performance (Reading about 6k messages per second out of kafka)

Re: PySpark + executor lost

2014-08-08 Thread Avishek Saha
So I think I have a better idea of the problem now. The environment is YARN client and IIRC PySpark doesn't run on YARN cluster. So my client is heavily loaded which causes iy loose a lot of e executors which might be part of the problem. Btw any plans in supporting PySpark in YARN clusters

Re: Lost executors

2014-08-08 Thread Avishek Saha
Same here Ravi. See my post on a similar thread. Are you running on YARN client? On Aug 7, 2014 2:56 PM, rpandya r...@iecommerce.com wrote: I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose fix it would be

Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-08 Thread Jenny Zhao
Hi, I am able to run my hql query on yarn cluster mode when connecting to the default hive metastore defined in hive-site.xml. however, if I want to switch to a different database, like: hql(use other-database) it only works in yarn client mode, but failed on yarn-cluster mode with the

Re: Use SparkStreaming to find the max of a dataset?

2014-08-08 Thread bumble123
Figured it out! Just mapped it to a .toInt version of itself. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkStreaming-to-find-the-max-of-a-dataset-tp11734p11812.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RDD partitioner or repartition examples?

2014-08-08 Thread buntu
I'm processing about 10GB of tab delimited rawdata with a few fields (page and user id along with timestamp when user viewed the page) using a 40 node cluster and using SparkSQL to compute the number of unique visitors per page at various intervals. I'm currently just reading the data as

Re: PySpark + executor lost

2014-08-08 Thread Sandy Ryza
Hi Avishek, As of Spark 1.0, PySpark does in fact run on YARN. -Sandy On Fri, Aug 8, 2014 at 12:47 PM, Avishek Saha avishek.s...@gmail.com wrote: So I think I have a better idea of the problem now. The environment is YARN client and IIRC PySpark doesn't run on YARN cluster. So my client

Re: PySpark + executor lost

2014-08-08 Thread Avishek Saha
You mean YARN cluster, right? Also, my jobs runs thru all their stages just fine. But the entire code crashes when I do a saveAsTextFile. On 8 August 2014 13:24, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Avishek, As of Spark 1.0, PySpark does in fact run on YARN. -Sandy On Fri, Aug 8,

Does Spark 1.0.1 stil collect results in serial???

2014-08-08 Thread makevnin
Hey I was reading the Berkley paper Spark: Cluster Computung with Working Sets and come across an sentence which is bothering me. Currently I am trying to run an python script on Spark which executes a parallel k-means ... my problem is ... after the algorithm finish working with the dataset (ca.

Re: PySpark + executor lost

2014-08-08 Thread Avishek Saha
Btw, I get this for Spark-1.0.2 I guess YARN cluster is still not supported for PySpark. - Error: Cluster deploy mode is currently not supported for python. Run with --help for usage help or --verbose for debug

Re: Does Spark 1.0.1 stil collect results in serial???

2014-08-08 Thread Xiangrui Meng
For the reduce/aggregate question, driver collects results in sequence. We now use tree aggregation in MLlib to reduce driver's load: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89 It is faster than aggregate when there are many

Re: Lost executors

2014-08-08 Thread rpandya
Hi Avishek, I'm running on a manual cluster setup, and all the code is Scala. The load averages don't seem high when I see these failures (about 12 on a 16-core machine). Ravi -- View this message in context:

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-08 Thread Thomas Nieborowski
near bottom: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html On Fri, Aug 8, 2014 at 2:00 AM, chutium teng@gmail.com wrote: try to add following jars in classpath

Re: PySpark + executor lost

2014-08-08 Thread Sandy Ryza
What exactly do you mean by YARN cluster. Do you mean running Spark against a YARN cluster in general, or particularly in yarn-cluster mode, where the driver runs inside a Spark application master? Also, what error are you seeing in your executors? -Sandy On Fri, Aug 8, 2014 at 2:00 PM,

Executors for Spark shell take much longer to be ready

2014-08-08 Thread durin
I recently moved my Spark installation from one Linux user to another one, i.e. changed the folder and ownership of the files. That was everything, no other settings were changed or different machines used. However, now it suddenly takes three minutes to have all executors in the Spark shell

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread x
Generally adjacency matrix is undirected(symmetric) on social network, so you can get eigenvectors from SVD computed result. A = UDV^t The first column of U is the biggest eigenvector corresponding to the first value of D. xj @ Tokyo On Sat, Aug 9, 2014 at 4:08 AM, Li Pu

Spark SQL dialect

2014-08-08 Thread Sathish Kumaran Vairavelu
Hi, Can you anyone point me where to find the sql dialect for Spark SQL? Unlike HQL, there are lot of tasks involved in creating and querying tables which is very cumbersome one. If we have to fire multiple queries on 10's and 100's of tables then it is very difficult at this point. Given Spark

Re: Unit Test for Spark Streaming

2014-08-08 Thread JiajiaJing
Hi TD, I tried some different setup on maven these days, and now I can at least get something when running mvn test. However, it seems like scalatest cannot find the test cases specified in the test suite. Here is the output I get:

Re: Custom Transformations in Spark

2014-08-08 Thread Tathagata Das
You can always define an arbitrary RDD-to-RDD function, use it from both Spark and Spark Streaming. For example, def myTransofmration(rdd: RDD[X]): RDD[Y] = { } In spark you can obvious apply it on an RDD. In spark streaming, you can apply on the RDDs of a DStream by

increase parallelism of reading from hdfs

2014-08-08 Thread Chen Song
In Spark Streaming, StreamContext.fileStream gives a FileInputDStream. Within each batch interval, it would launch map tasks for the new files detected during that interval. It appears that the way Spark compute the number of map tasks is based oo block size of files. Below is the quote from

OOM writing out sorted RDD

2014-08-08 Thread Bharath Ravi Kumar
Our prototype application reads a 20GB dataset from HDFS (nearly 180 partitions), groups it by key, sorts by rank and write out to HDFS in that order. The job runs against two nodes (16G, 24 cores per node available to the job). I noticed that the execution plan results in two sortByKey stages,