Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
Is it always true that whenever we apply operations on an RDD, we get another RDD? Or does it depend on the return type of the operation? On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta wrote: > > An RDD is a fault-tolerant distributed structure. It is the primary > abstraction in Spark. > > I w

Re: How to initialize StateDStream

2014-09-12 Thread qihong
there's no need to initialize StateDStream. Take a look at example StatefulNetworkWordCount.scala, it's part of spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html Sent from the Apache Spark Us

SPARK_MASTER_IP

2014-09-12 Thread Koert Kuipers
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sbin/start-slaves.sh are the only ones that use it. yet for example in CDH5 the spark-master is started from /etc/init.d/spark-master by running bin/spark-class. does that means SPARK_MASTER_IP is simply ignored? it looks like that to

Re: compiling spark source code

2014-09-12 Thread qihong
follow the instruction here: http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Looking for a good sample of Using Spark to do things Hadoop can do

2014-09-12 Thread Steve Lewis
Assume I have a large book with many Chapters and many lines of text. Assume I have a function that tells me the similarity of two lines of text. The objective is to find the most similar line in the same chapter within 200 lines of the line found. The real problem involves biology and is beyond t

Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping b

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK wrote

Re: Spark and Scala

2014-09-12 Thread Soumya Simanta
An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html http://spark.apache.org/docs/latest/quick-start.htm

Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li (l...

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct methods to create RDD which is a queue right? What say do you have on this? Does there exist something like: *Cr

Re: Spark and Scala

2014-09-12 Thread Hari Shreedharan
No, Scala primitives remain primitives. Unless you create an RDD using one of the many methods - you would not be able to access any of the RDD methods. There is no automatic porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT co

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > unpersist is a method on RDDs. RDDs are abstractions introduce

workload for spark

2014-09-12 Thread 牛兆捷
We know some memory of spark are used for computing (e.g., shuffle buffer) and some are used for caching RDD for future use. Is there any existing workload which utilize both of them? I want to do some performance study by adjusting the ratio between them.

sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Mohit Jaggi
Folks, I think this might be due to the default TextInputFormat in Hadoop. Any pointers to solutions much appreciated. >> More powerfully, you can define your own *InputFormat* implementations to format the input to your programs however you want. For example, the default TextInputFormat reads line

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it was stated as PROCESS_LOCAL which was confusing. Regards, Liming On 13 Sep, 2014, at 3:12 am, Nicholas Chammas wrote: > Andrew, > > This email was pretty helpful. I feel like this stuff should be summarized in >

NullWritable not serializable

2014-09-12 Thread Du Li
Hi, I was trying the following on spark-shell (built with apache master and hadoop 2.4.0). Both calling rdd2.collect and calling rdd3.collect threw java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. I got the same problem in similar code of my app which uses the newly released

Re: spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
The same command passed in another quick-start vm (v4.7) which has hbase 0.96 installed. maybe there are some conflicts for the newer hbase version and spark 1.1.0? just my guess. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-cla

Executor garbage collection

2014-09-12 Thread Tim Smith
Hi, Anyone setting any explicit GC options for the executor jvm? If yes, what and how did you arrive at them? Thanks, - Tim - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-12 Thread Michael Armbrust
Something like the following should let you launch the thrift server on yarn. HADOOP_CONF_DIR=/etc/hadoop/conf HIVE_SERVER2_THRIFT_PORT=12345 MASTER=yarn- client ./sbin/start-thriftserver.sh On Thu, Sep 11, 2014 at 8:30 PM, Denny Lee wrote: > Could you provide some context about running this

spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
Newbie for Java. so please be specific on how to resolve this, The command I was running is $ ./spark-submit --driver-class-path /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar /home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/

Where do logs go in StandAlone mode

2014-09-12 Thread Tim Smith
Spark 1.0.0 I write logs out from my app using this object: object LogService extends Logging { /** Set reasonable logging levels for streaming if the user has not configured log4j. */ def setStreamingLogLevels() { val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
Ah, I see. So basically what you need is something like cache write through support which exists in Shark but not implemented in Spark SQL yet. In Shark, when inserting data into a table that has already been cached, the newly inserted data will be automatically cached and “union”-ed with the exist

Re: SparkSQL hang due to

2014-09-12 Thread Michael Armbrust
What is in your hive-site.xml? On Thu, Sep 11, 2014 at 11:04 PM, linkpatrickliu wrote: > I am running Spark Standalone mode with Spark 1.1 > > I started SparkSQL thrift server as follows: > ./sbin/start-thriftserver.sh > > Then I use beeline to connect to it. > Now, I can "CREATE", "SELECT", "SH

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK
This issue is resolved. Looks like in the new spark-submit, the jar path has to be at the end of the options. Earlier I could specify this path in any order on the command line. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load

Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK
Hi, I am using the Spark 1.1.0 version that was released yesterday. I recompiled my program to use the latest version using "sbt assembly" after modifying .sbt to use the 1.1.0 version. The compilation goes through and the jar is built. When I run the jar using spark-submit, I get an error: "Canno

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Archit Thakur
LittleCode snippet: line1: cacheTable(existingRDDTableName) line2: //some operations which will materialize existingRDD dataset. line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) line4: cacheTable(new_existingRDDTableName) line5: //some operation that will materialize new

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Nicholas Chammas
Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash wrote: > The locality is how close the data is to the code that's processing it. > PROCE

Re: Configuring Spark for heterogenous hardware

2014-09-12 Thread Victor Tso-Guillen
Ping... On Thu, Sep 11, 2014 at 5:44 PM, Victor Tso-Guillen wrote: > So I have a bunch of hardware with different core and memory setups. Is > there a way to do one of the following: > > 1. Express a ratio of cores to memory to retain. The spark worker config > would represent all of the cores a

EOFException when reading from HDFS

2014-09-12 Thread kents
I just started playing with Spark. So I ran the SimpleApp program from tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine. However, if I change the file location from local to hdfs, then I get an EOFException. I did some search online which suggests this error is ca

Re: Yarn Over-allocating Containers

2014-09-12 Thread Sandy Ryza
Hi Praveen, I believe you are correct. I noticed this a little while ago and had a fix for it as part of SPARK-1714, but that's been delayed. I'll look into this a little deeper and file a JIRA. -Sandy On Thu, Sep 11, 2014 at 11:44 PM, praveen seluka wrote: > Hi all > > Am seeing a strange i

Re: EOFException when reading from HDFS

2014-09-12 Thread kent
Can anyone help me with this? I have been stuck on this for a few days and don't know what to try anymore. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-reading-from-HDFS-tp13844p14115.html Sent from the Apache Spark User List mailing li

Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
Turns out it was Spray with a bad route -- the results weren't updating despite the table running. This thread can be ignored. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14114.html Sent from the Apache Spark

How to initialize StateDStream

2014-09-12 Thread Soumitra Kumar
Hello, How do I initialize StateDStream used in updateStateByKey? -Soumitra.

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache t

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache tab

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
You can always use sqlContext.uncacheTable to uncache the old table. ​ On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora wrote: > Hi Patrick, > > What if all the data has to be keep in cache all time. If applying union > result in new RDD then caching this would result into keeping older as well >

Re: Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread Davies Liu
By SparkContext.addPyFile("xx.zip"), the xx.zip will be copies to all the workers and stored in temporary directory, the path to xx.zip will be in the sys.path on worker machines, so you can "import xx" in your jobs, it does not need to be installed on worker machines. PS: the package or module sh

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Tim Smith
Similar issue (Spark 1.0.0). Streaming app runs for a few seconds before these errors start to pop all over the driver logs: 14/09/12 17:30:23 WARN TaskSetManager: Loss was due to java.lang.Exception java.lang.Exception: Could not compute split, block input-4-1410542878200 not found at org

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj.arora
Hi Patrick, What if all the data has to be keep in cache all time. If applying union result in new RDD then caching this would result into keeping older as well as this into memory hence duplicating data. Below is what i understood from your comment. sqlContext.cacheTable(existingRDD)// caches t

Re: split a RDD by pencetage

2014-09-12 Thread pankaj.arora
You can use MapPartitions to achieve this. /split each partition into 10 equal parts with each part having number as its id val splittedRDD = self.mapPartitions((itr)=> { Iterate over this iterator and breaks this iterator into 10 parts. val iterators = Array[ArrayBuffer[T]](10) var i =0 for(tuple

Stable spark streaming app

2014-09-12 Thread Tim Smith
Hi, Anyone have a stable streaming app running in "production"? Can you share some overview of the app and setup like number of nodes, events per second, broad stream processing workflow, config highlights etc? Thanks, Tim - To

Re: Spark and Scala

2014-09-12 Thread Nicholas Chammas
unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan wrote: > There is one thing that I am confused about. > Spark has code

Spark and Scala

2014-09-12 Thread Deep Pradhan
There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had

Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Davies Liu
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller wrote: > Hi Davies, > > Thanks for the quick fix. I'm sorry to send out a bug report on release day > - 1.1.0 really is a great release. I've been running the 1.1 branch for a > while and there's definitely lots of good stuff. > > For the workaround, I

Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
So, after toying around a bit, here's what I ended up with. First off, there's no function "registerTempTable" -- "registerTable" seems to be enough to work (it's the same whether directly on a SchemaRDD or on a SqlContext being passed an RDD). The problem I encountered after was reloading a table

Re: Nested Case Classes (Found and Required Same)

2014-09-12 Thread Prashant Sharma
What is your spark version ? This was fixed I suppose. Can you try it with latest release ? Prashant Sharma On Fri, Sep 12, 2014 at 9:47 PM, Ramaraju Indukuri wrote: > This is only a problem in shell, but works fine in batch mode though. I am > also interested in how others are solving the p

slides from df talk at global big data conference

2014-09-12 Thread Mohit Jaggi
http://engineering.ayasdi.com/2014/09/11/df-dataframes-on-spark/

Re: Nested Case Classes (Found and Required Same)

2014-09-12 Thread Ramaraju Indukuri
This is only a problem in shell, but works fine in batch mode though. I am also interested in how others are solving the problem of case class limitation on number of variables. Regards Ram On Fri, Sep 12, 2014 at 12:12 PM, iramaraju wrote: > I think this is a popular issue, but need help figur

Nested Case Classes (Found and Required Same)

2014-09-12 Thread iramaraju
I think this is a popular issue, but need help figuring a way around if this issue is unresolved. I have a dataset that has more than 70 columns. To have all the columns fit into my RDD, I am experimenting the following. (I intend to use the InputData to parse the file and have 3 or 4 columnsets to

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur wrote: > Hi, > > We have a use case where we are plannin

Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
Hi Davies, Thanks for the quick fix. I'm sorry to send out a bug report on release day - 1.1.0 really is a great release. I've been running the 1.1 branch for a while and there's definitely lots of good stuff. For the workaround, I think you may have meant: srdd2 = SchemaRDD(srdd._jschema_rdd.c

Why I get java.lang.OutOfMemoryError: Java heap space with join ?

2014-09-12 Thread Jaonary Rabarisoa
Dear all, I'm facing the following problem and I can't figure how to solve it. I need to join 2 rdd in order to find their intersections. The first RDD represent an image encoded in base64 string associated with image id. The second RDD represent a set of geometric primitives (rectangle) associa

How to initiate a shutdown of Spark Streaming context?

2014-09-12 Thread stanley
In spark streaming programming document , it specifically states how to shut down a spark streaming context: The existing application is shutdown gracefully (see StreamingContext.stop(...) or JavaStreamingContext.stop(...)

Fwd: Define the name of the outputs with Java-Spark.

2014-09-12 Thread Guillermo Ortiz
I would like to define the names of my output in Spark, I have a process which write many fails and I would like to name them, is it possible? I guess that it's not possible with saveAsText method. It would be something similar to the MultipleOutput of Hadoop.

Re: spark sql - create new_table as select * from table

2014-09-12 Thread jamborta
thanks. I will try to do that way. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Error "Driver disassociated" while running the spark job

2014-09-12 Thread Akhil Das
What is your system setup? Can you paste the spark-env.sh? Looks like you have some issues with your configuration. Thanks Best Regards On Fri, Sep 12, 2014 at 6:31 PM, 남윤민 wrote: > I got this error from the executor's stderr: > > > > > > Using Spark's default log4j profile: > org/apache/spark

Re: What is a pre built package of Apache Spark

2014-09-12 Thread andrew.craft
Hi, I do not see any pre-build binaries on the site currently. I am using the make-distribution.sh to create a binary package. After that is done the the file generated by that will allow you to run execute the scripts in the bin folder. HTH, Andrew -- View this message in context: http://apa

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Mayur Rustagi
Driver needs a consistent connection to the master in standalone mode as whole bunch of client stuff happens on the driver. So calls like parallelize send data from driver to the master & collect send data from master to the driver.  If you are looking to avoid the connect you can look into embe

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Jim Carroll
Hi Akhil, Thanks! I guess in short that means the master (or slaves?) connect back to the driver. This seems like a really odd way to work given the driver needs to already connect to the master on port 7077. I would have thought that if the driver could initiate a connection to the master, that w

Error "Driver disassociated" while running the spark job

2014-09-12 Thread 남윤민
I got this error from the executor's stderr: [akka.tcp://sparkDriver@saturn00:49464] disassociated! Shutting down. What is the reason of "Actor not found"? // Yoonmin Nam - To unsubscribe, e-mail: user-unsubscr...@spark

Re: Out of memory with Spark Streaming

2014-09-12 Thread Aniket Bhatnagar
Hi all Sorry but this was totally my mistake. In my persistence logic, I was creating async http client instance in RDD foreach but was never closing it leading to memory leaks. Apologies for wasting everyone's time. Thanks, Aniket On 12 September 2014 02:20, Tathagata Das wrote: > Which vers

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-12 Thread Aniket Bhatnagar
Hi Reinis Try if the exclude suggestion from me and Sean works for you. If not, can you turn on verbose class loading to see from where javax.servlet.ServletRegistration is loaded? The class should load from "org.mortbay.jetty" % "servlet-api" % jettyVersion. If it loads from some other jar, you w

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
I agree, Even the Low Level Kafka Consumer which I have written has tunable IO throttling which help me solve this issue ... But question remains , even if there are large backlog, why Spark drop the unprocessed memory blocks ? Dib On Fri, Sep 12, 2014 at 5:47 PM, Jeoffrey Lim wrote: > Our iss

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Jeoffrey Lim
Our issue could be related to this problem as described in: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html which the DStream is processed for every 1 hour batch duration. I have implemented IO throttling in the Receiver

Re: Applications status missing when Spark HA(zookeeper) enabled

2014-09-12 Thread jason chen
anybody met this high availability problem with zookeeper? 2014-09-12 10:34 GMT+08:00 jason chen : > Hi guys, > > I configured Spark with the configuration in spark-env.sh: > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=host1:2181,host2:218

Using filter in joined dataset

2014-09-12 Thread vishnu86
I am newbie to scala and spark. I am joining two datasets , first one coming from stream and second one which is in HDFS. I am using scala in spark. After joining the two datasets , I need to apply filter on the joined datasets, but here I am facing as issue. Please assist to resolve. I am using

Serving data

2014-09-12 Thread Marius Soutier
Hi there, I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote Scalding jobs - one-off, read data from HDFS, count words, write counts back to HDFS. Now I want to display these counts in a dashboard. Since Spark allows to cache RDDs in-memory and you have to explicitly

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
Dear all, I am sorry. This was a false alarm There was some issue in the RDD processing logic which leads to large backlog. Once I fixed the issues in my processing logic, I can see all messages being pulled nicely without any Block Removed error. I need to tune certain configurations in my Kafka

Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread yh18190
Hi all, I am currently working on pyspark for NLP processing etc.I am using TextBlob python library.Normally in a standalone mode it easy to install the external python libraries .In case of cluster mode I am facing problem to install these libraries on worker nodes remotely.I cannot access each a

Re: replicate() method in BlockManager.scala choosing only one node for replication.

2014-09-12 Thread Kartheek.R
When I see the storage details of the rdd in the webUI, I find that each block is replicated twice and not on a single node. All the nodes in the cluster are hosting some block or the other. Why is this difference?? The trace of replicate() method shows only one node. But, webUI shows multiple nod

Re: Kyro deserialisation error

2014-09-12 Thread ayandas84
Hi, I am also facing the same problem. Has any one found out the solution yet? It just returns a vague set of characters. Please help.. Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Exception while deserializing and fetching task: com.esotericsof

Perserving conf files when restarting ec2 cluster

2014-09-12 Thread jerryye
Hi, I'm using --use-existing-master to launch a previous stopped ec2 cluster with spark-ec2. However, my configuration files are overwritten once is the cluster is setup. What's the best way of preserving existing configuration files in spark/conf. Alternatively, what I'm trying to do is set SPARK

Re: RDD memory questions

2014-09-12 Thread Boxian Dong
Thank you very much for your help :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p14069.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Thank you, David! It works. import org.apache.spark.util.StatCounter val a = ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida, pricea), (matchedidb, priceb))) => ((matchedida, matchedidb), (if(priceb > 0) (pricea/priceb).toDouble else 0.toDouble))} .groupByKey .

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL): SELECT order, mean(price) FROM orders GROUP BY order In this case, I'm not aware of a way to use the DoubleRDDFunctions, since you have a single RDD of pairs where each pair is of type (KeyType, Iterable[Double]). It seems to me that

Re: Computing mean and standard deviation by key

2014-09-12 Thread Sean Owen
These functions operate on an RDD of Double which is not what you have, so no this is not a way to use DoubleRDDFunctions. See earlier in the thread for canonical solutions. On Sep 12, 2014 8:06 AM, "rzykov" wrote: > Tried this: > > ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matched

Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Tried this: ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida, pricea), (matchedidb, priceb))) => ((matchedida, matchedidb), (if(priceb > 0) (pricea/priceb).toDouble else 0.toDouble))} .groupByKey .values.stats .first Error: :37: error: could not find imp