Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Pranay Dave
Hello Shivram Thanks for your reply. Here is a simple data set input. This data is in file called /sparkdev/datafiles/covariance.txt 1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8 9,9 10,10 Output I would like to see is a total of columns. It can be done with reduce, but I wanted to test lapply. Output I

Re: Using Python IDE for Spark Application Development

2014-08-07 Thread Mohit Singh
Take a look at this gist https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8 That worked for me. On Wed, Aug 6, 2014 at 7:32 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Mohit, This doesn't seems to be working can you please provide more details? when I use from pyspark

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
Okay, going back to your origin question, it wasnt clear what is the reduce function that you are trying to implement. Going by the 2nd example using window() operation, following by a count+filter (using sql), I am guessing you are trying to maintain a count of the all the active states in the

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Rok Roskar
thanks for the quick answer! numpy array only can support basic types, so we can not use it during collect() by default. sure, but if you knew that a numpy array went in on one end, you could safely use it on the other end, no? Perhaps it would require an extension of the RDD class and

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Zongheng Yang
Hi Pranay, If this is data format is to be assumed, then I believe the issue starts at lines - textFile(sc,/sparkdev/datafiles/covariance.txt) totals - lapply(lines, function(lines) After the first line, `lines` becomes an RDD of strings, each of which is a line of the form 1,1.

Re: Naive Bayes parameters

2014-08-07 Thread SK
I followed the example in examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala. IN this file Params is defined as follows: case class Params ( input: String = null, minPartitions: Int = 0, numFeatures: Int = -1, lambda: Double = 1.0) In the main

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
OK, I think I've figured it out. It seems to be a bug which has been reported at: https://issues.apache.org/jira/browse/SPARK-2823 and https://github.com/apache/spark/pull/1763. As it says: If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition

Re: Regularization parameters

2014-08-07 Thread SK
Hi, I am following the code in examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala For setting the parameters and parsing the command line options, I am just reusing that code.Params is defined as follows. case class Params( input: String = null,

Re: Naive Bayes parameters

2014-08-07 Thread Xiangrui Meng
It is used in data loading: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76 On Thu, Aug 7, 2014 at 12:47 AM, SK skrishna...@gmail.com wrote: I followed the example in

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Khanderao Kand
I hope this has been resolved, were u connected to right zookeeper? did Kafka and HBase share the same zookeeper and port? If not, did u set a right config for Hbase job? -- Khanderao On Wed, Jul 2, 2014 at 4:12 PM, JiajiaJing jj.jing0...@gmail.com wrote: Hi, I am trying to write a program

Re: Spark with HBase

2014-08-07 Thread Akhil Das
You can download and compile spark against your existing hadoop version. Here's a quick start https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types You can also read a bit here http://docs.sigmoidanalytics.com/index.php/Installing_Spark_andSetting_Up_Your_Cluster ( the

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Rohit Rai
Alan/TD, We are facing the problem in a project going to production. Was there any progress on this? Are we able to confirm that this is a bug/limitation in the current streaming code? Or there is anything wrong in user scope? Regards, Rohit *Founder CEO, **Tuplejump, Inc.*

Got error “java.lang.IllegalAccessError when using HiveContext in Spark shell on AWS

2014-08-07 Thread Zhun Shen
Hi, When I try to use HiveContext in Spark shell on AWS, I got the error java.lang.IllegalAccessError: tried to access method com.google.common.collect.MapMaker.makeComputingMap(Lcom/google/common/base/Function;)Ljava/util/concurrent/ConcurrentMap. I follow the steps below to compile and install

Re: Got error “java.lang.IllegalAccessError when using HiveContext in Spark shell on AWS

2014-08-07 Thread Cheng Lian
Hey Zhun, Thanks for the detailed problem description. Please see my comments inlined below. On Thu, Aug 7, 2014 at 6:18 PM, Zhun Shen shenzhunal...@gmail.com wrote: Caused by: java.lang.IllegalAccessError: tried to access method

Re: Bad Digest error while doing aws s3 put

2014-08-07 Thread lmk
This was a completely misleading error message.. The problem was due to a log message getting dumped to the stdout. This was getting accumulated in the workers and hence there was no space left on device after some time. When I re-tested with spark-0.9.1, the saveAsTextFile api threw no space

Re: Spark SQL

2014-08-07 Thread vdiwakar.malladi
Thanks for your response. I could able to compile my code now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp11618p11644.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to

How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread yaochunnan
Our lab need to do some simulation on online social networks. We need to handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue and corresponding eigenvector. Matlab can be used but it is time-consuming. Is Spark effective in linear algebra calculations and transformations?

Re: Spark with HBase

2014-08-07 Thread chutium
this two posts should be good for setting up spark+hbase environment and use the results of hbase table scan as RDD settings http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html some samples: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html -- View

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
You may use groupByKey in this case. On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated

Low Performance of Shark over Spark.

2014-08-07 Thread vinay . kashyap
Dear all, I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. 6 worker nodes each with memory 96GB and 32 cores. I am using Shark Shell to execute queries on Spark. I have a raw_table ( of size 3TB with replication 3 ) which is partitioned by year, month and day. I am running

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-07 Thread Nick Pentreath
I'm also getting this - Ryan we both seem to be running into this issue with elasticsearch-hadoop :) I tried spark.files.userClassPathFirst true on command line and that doesn;t work If I put it that line in spark/conf/spark-defaults it works but now I'm getting: java.lang.NoClassDefFoundError:

Re: Spark Hbase job taking long time

2014-08-07 Thread Ted Yu
Forgot to include user@ Another email from Amit indicated that there is 1 region in his table. This wouldn't give you the benefit TableInputFormat is expected to deliver. Please split your table into multiple regions. See http://hbase.apache.org/book.html#d3593e6847 and related links. Cheers

Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network.

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context:

Re: Save an RDD to a SQL Database

2014-08-07 Thread 诺铁
I haven't seen people write directly to sql database, mainly because it's difficult to deal with failure, what if network broken in half of the process? should we drop all data in database and restart from beginning? if the process is Appending data to database, then things becomes even complex.

[Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread linkpatrickliu
Hi, Following the document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package I compile Spark 1.0.2 with this cmd: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package However, I got two errors: [INFO] Compiling 14 Scala

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general,

Re: Save an RDD to a SQL Database

2014-08-07 Thread Cheng Lian
Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? If you’re just trying to do further transformations/queries with SQL for convenience, then you may just use Spark SQL directly within your Spark application without saving them into DB:

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 noty...@gmail.com wrote: what if network broken in half of the process? should we drop all data in database and restart from beginning? The best way to deal with this -- which, unfortunately, is not commonly supported -- is with a two-phase commit that can

spark streaming multiple file output paths

2014-08-07 Thread Chen Song
In Spark Streaming, is there a way to write output to different paths based on the partition key? The saveAsTextFiles method will write output in the same directory. For example, if the partition key has a hour/day column and I want to separate DStream output into different directories by

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian lian.cs@gmail.com wrote: Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new

JVM Error while building spark

2014-08-07 Thread Rasika Pohankar
Hello, I am trying to build Apache Spark version 1.0.1 on Ubuntu 12.04 LTS. After unzipping the file and running sbt/sbt assembly I get the following error : rasika@rasikap:~/spark-1.0.1$ sbt/sbt package Error occurred during initialization of VM Could not reserve enough space for object heap

Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Hello all, I am not sure what is going on – I am getting a NotSerializedException and initially I thought it was due to not registering one of my classes with Kryo but that doesn’t seem to be the case. I am essentially eliminating duplicates in a spark streaming application by using a “window”

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: AlexanderRiggers alexander.rigg...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40

Initial job has not accepted any resources

2014-08-07 Thread arnaudbriche
Hi, I'm trying a simple thing: create an RDD from a text file (~3GB) located in GlusterFS, which is mounted by all Spark cluster machines, and calling rdd.count(); but Spark never managed to complete the job, giving message like the following: WARN TaskSchedulerImpl: Initial job has not accepted

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu,

Re: Initial job has not accepted any resources

2014-08-07 Thread Marcelo Vanzin
There are two problems that might be happening: - You're requesting more resources than the master has available, so your executors are not starting. Given your explanation this doesn't seem to be the case. - The executors are starting, but are having problems connecting back to the driver. In

Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
darkjh wrote But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. This is actually not true, Spark uses the underlying hadoop input formats to read the

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
It could be because of the variable enableOpStat. Since its defined outside foreachRDD, referring to it inside the rdd.foreach is probably causing the whole streaming context being included in the closure. Scala funkiness. Try this, see if it works. msgCount.join(ddCount).foreachRDD((rdd:

Re: spark streaming multiple file output paths

2014-08-07 Thread Tathagata Das
The problem boils down to how to write an RDD in that way. You could use the HDFS Filesystem API to write each partition directly. pairRDD.groupByKey().foreachPartition(iterator = iterator.map { case (key, values) = // Open an output stream to destination file base-path/key/whatever

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Tathagata Das
For future reference in this thread, a better set of examples than the MetricAggregatorHBase on the JIRA to look at are here https://github.com/tmalaska/SparkOnHBase On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand khanderao.k...@gmail.com wrote: I hope this has been resolved, were u

Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
I wanted to post for validation to understand if there is more efficient way to achieve my goal. I'm currently performing this flow for two distinct calculations executing in parallel: 1) Sum key/value pair, by using a simple witnessed count(apply 1 to a mapToPair() and then groupByKey() 2)

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Tathagata Das
Another possible reason behind this maybe that there are two versions of Akka present in the classpath, which are interfering with each other. This could happen through many scenarios. 1. Launching Spark application with Scala brings in Akka from Scala, which interferes with Spark's Akka 2.

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Sean Owen
(-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in

Re: KMeans Input Format

2014-08-07 Thread Sean Owen
It's not running out of memory on the driver though, right? the executors may need more memory, or use more executors. --executory-memory would let you increase from the default of 512MB. On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, Could you try running

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
As a follow up, I commented out that entire code and I am still getting the exception. It may be related to what you are suggesting so are there any best practices so that I can audit other parts of the code? Thanks, Mahesh From: Padmanabhan, Mahesh Padmanabhan

Re: JVM Error while building spark

2014-08-07 Thread Sean Owen
(-incubator, +user) It's not Spark running out of memory, but SBT, so those env variables have no effect. They're options to Spark at runtime anyway, not compile time, and you're intending to compile I take it. SBT is a memory hog, and Spark is a big build. You will probably need to give it more

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread main

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Can you enable the java flag -Dsun.io.serialization.extendedDebugInfo=true for driver in your driver startup-script? That should give an indication of the sequence of object references that lead to the StremaingContext being included in the closure. TD On Thu, Aug 7, 2014 at 10:23 AM,

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
Can you try with -Pyarn instead of -Pyarn-alpha? I'm pretty sure CDH4 ships with the newer Yarn API. On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu linkpatrick...@live.com wrote: Hi, Following the document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK skrishna...@gmail.com wrote: Spark 1.0.1 thanks -- View this message in context:

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, vinay.kash...@socialinfra.net wrote: Dear all, I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. 6 worker

Re: How to read a multipart s3 file?

2014-08-07 Thread Sean Owen
That won't be it, since you can see from the directory listing that there are no data files under test -- only _ files and dirs. The output looks like it was written, or partially written at least, but didn't finish, in that the part-* files were never moved to the target dir. I don't know why,

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7,

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Tathagata Das
I am not sure if it is a typo-error or not, but how are you using groupByKey to get the summed_values? Assuming you meant reduceByKey(), these workflows seems pretty efficient. TD On Thu, Aug 7, 2014 at 10:18 AM, Dan H. dch.ema...@gmail.com wrote: I wanted to post for validation to understand

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802. *# these all work* sqlCtx.jsonRDD(sc.parallelize(['{record:

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Shivaram Venkataraman
If you just want to find the top eigenvalue / eigenvector you can do something like the Lanczos method. There is a description of a MapReduce based algorithm in Section 4.2 of [1] [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf On Thu, Aug 7, 2014 at 10:54 AM, Li Pu

Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of memory. I restructured my code now, because I forgot to cache my dataset and set down number of iterations to 2, but still get kicked out of Spark. Did I cache the data wrong (sorry not an expert): scala import

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Sean Owen
Yep, this command given in the Spark docs is correct: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package and while I also would hope that this works, it doesn't compile: mvn -Pyarn -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package I believe later 4.x includes

Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Vida, What kind of database are you trying to write to? For example, I found that for loading into

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Does this help? I can’t figure out anything new from this extra information. Thanks, Mahesh 2014-08-07 12:27:00,170 [spark-akka.actor.default-dispatcher-4] ERROR akka.actor.OneForOneStrategy - org.apache.spark.streaming.StreamingContext - field (class

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
I think Cloudera only started adding Spark to CDH4 starting with 4.6, so maybe that's the minimum if you want to try out Spark on CDH4. On Thu, Aug 7, 2014 at 11:22 AM, Sean Owen so...@cloudera.com wrote: Yep, this command given in the Spark docs is correct: mvn -Pyarn-alpha

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread amit
There is one more configuration option called spark.closure.serializer that can be used to specify serializer for closures. Maybe in the the class you have Streaming Context as a field, so when spark tries to serialize the whole class it uses the spark.closure.serializer to serialize even the

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
That's a good idea - to write to files first and then load. Thanks. On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM,

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
From the extended info, I see that you have a function called createStreamingContext() in your code. Somehow that is getting referenced in in the foreach function. Is the whole foreachRDD code inside the createStreamingContext() function? Did you try marking the ssc field as transient? Here is a

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
Yes, thanks, I did in fact mean reduceByKey(), thus allowing the convenience method process the summation by key. Thanks for your feedback! DH -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677p11706.html Sent from

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Thanks TD, Amit. I think I figured out where the problem is through the process of commenting out individual lines of code one at a time :( Can either of you help me find the right solution? I tried creating the SparkContext outside the foreachRDD but that didn’t help. I have an object (let’s

questions about MLLib recommendation models

2014-08-07 Thread Jay Hutfles
I have a few questions regarding a collaborative filtering model, and was hoping for some recommendations (no pun intended...) *Setup* I have a csv file with user/movie/ratings named unimaginatively 'movies.csv'. Here are the contents: 0,0,5 0,1,5 0,2,0 0,3,0 1,0,5 1,3,0 2,1,4 2,2,0 3,0,0

Re: questions about MLLib recommendation models

2014-08-07 Thread Sean Owen
On Thu, Aug 7, 2014 at 9:06 PM, Jay Hutfles jayhutf...@gmail.com wrote: 0,0,5 0,1,5 0,2,0 0,3,0 1,0,5 1,3,0 2,1,4 2,2,0 3,0,0 3,1,0 3,2,5 3,3,4 4,0,0 4,1,0 4,2,5 val rank = 10 This is likely the problem? your rank is actually larger than the number of users or items. The error

Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r

Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G of RAM, try bumping this up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html Sent from the Apache Spark User List mailing list

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having a bit of trouble with nested data structures in pyspark

RE: Save an RDD to a SQL Database

2014-08-07 Thread Jim Donahue
Depending on what you mean by save, you might be able to use the Twitter Storehaus package to do this. There was a nice talk about this at a Spark meetup -- Stores, Monoids and Dependency Injection - Abstractions for Spark Streaming Jobs. Video here:

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai yh...@databricks.com wrote: Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data type). Right now, we have a step to convert the data type from NullType to StringType. This logic in the master has a bug. We

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
The PR is https://github.com/apache/spark/pull/1840. On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai yh...@databricks.com wrote: Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Well I dont see the rdd in the foreachRDD being passed into the A.func1() so I am not sure what is purpose of the function. Assuming that you do want to pass on that RDD into that function, and also want to have access to the sparkContext, you can only pass on the RDD and then access the

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Slap my head moment – using rdd.context solved it! Thanks TD, Mahesh From: Tathagata Das tathagata.das1...@gmail.commailto:tathagata.das1...@gmail.com Date: Thursday, August 7, 2014 at 3:06 PM To: Mahesh Padmanabhan

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
LOL! Glad it solved it. TD On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Slap my head moment – using rdd.context solved it! Thanks TD, Mahesh From: Tathagata Das tathagata.das1...@gmail.com Date: Thursday, August 7, 2014

Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache()

Lost executors

2014-08-07 Thread rpandya
I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose fix it would be appreciated. There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per

Re: PySpark + executor lost

2014-08-07 Thread Davies Liu
What is the environment ? YARN or Mesos or Standalone? It will be more helpful if you could show more loggings. On Wed, Aug 6, 2014 at 7:25 PM, Avishek Saha avishek.s...@gmail.com wrote: Hi, I get a lot of executor lost error for saveAsTextFile with PySpark and Hadoop 2.4. For small

Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-07 Thread ajatix
Hi I wish to migrate from shark to the spark-sql shell, where I am facing some difficulties in setting up. I cloned the branch-1.0-jdbc to test out the spark-sql shell, but I am unable to run it after building the source. I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Pranay Dave
Hello Zongheng Infact the problem is in lapplyPartition lapply gives output as 1,1 2,2 3,3 ... 10,10 However lapplyPartition gives output as 55, NA 55, NA Why lapply output is horizontal and lapplyPartition is vertical ? Here is my code library(SparkR) sc - sparkR.init(local) lines -

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-08-07 Thread anthonyjschu...@gmail.com
Similarly, I am seeing tasks moved to the completed section which apparently haven't finished all elements... (succeeded/total 1)... is this related? -- View this message in context:

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Davies Liu
On Thu, Aug 7, 2014 at 12:06 AM, Rok Roskar rokros...@gmail.com wrote: sure, but if you knew that a numpy array went in on one end, you could safely use it on the other end, no? Perhaps it would require an extension of the RDD class and overriding the colect() method. Could you give a short

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread salemi
Hi, Thank you or your help. With the new code I am getting the following error in the driver. What is going wrong here? 14/08/07 13:22:28 ERROR JobScheduler: Error running job streaming job 1407450148000 ms.0 org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4528] at apply at

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
Are you running on a cluster but giving a local path in ssc.checkpoint(...) ? TD On Thu, Aug 7, 2014 at 3:24 PM, salemi alireza.sal...@udo.edu wrote: Hi, Thank you or your help. With the new code I am getting the following error in the driver. What is going wrong here? 14/08/07 13:22:28

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread salemi
That is correct. I do scc.checkpOint(checkpoint). Why is the checkpoint required? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11731.html Sent from the Apache

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread JiajiaJing
Hi Khanderao and TD Thank you very much for your reply and the new example. I have resolved the problem. The zookeeper port I used wasn't right, the default port is not the one that I suppose to use. So I set the hbase.zookeeper.property.clientPort to the correct port and everything worked.

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
That is required for driver fault-tolerance, as well as for some transformations like updateSTateByKey that persist information across batches. It must be a HDFS directory when running on a cluster. TD On Thu, Aug 7, 2014 at 4:25 PM, salemi alireza.sal...@udo.edu wrote: That is correct. I do

Use SparkStreaming to find the max of a dataset?

2014-08-07 Thread bumble123
I can't figure out how to use Spark Streaming to find the max of a 5 second batch of data and keep updating the max every 5 seconds. How would I do this? -- View this message in context:

Re: Use SparkStreaming to find the max of a dataset?

2014-08-07 Thread Tathagata Das
You can do the following. var globalMax = ... dstreamOfNumericalType.foreachRDD( rdd = { globalMax = math.max(rdd.max, globalMax) }) globalMax will keep getting updated after every batch TD On Thu, Aug 7, 2014 at 5:31 PM, bumble123 tc1...@att.com wrote: I can't figure out how to use

Re: memory issue on standalone master

2014-08-07 Thread Baoqiang Cao
My problem was that I didn’t know how to add. For what might be worthy, it was solved by editing the spark-env.sh. Thanks anyway! Baoqiang Cao Blog: http://baoqiang.org Email: bqcaom...@gmail.com On Aug 7, 2014, at 3:27 PM, maddenpj madde...@gmail.com wrote: It looks like your Java heap

Re: Regularization parameters

2014-08-07 Thread SK
What is the definition of regParam and what is the range of values it is allowed to take? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11737.html Sent from the Apache Spark User List mailing list archive at

  1   2   >