SparkR Installation

2014-06-18 Thread Stuti Awasthi
Hi All, I wanted to try SparkR. Do we need preinstalled R on all the nodes of the cluster before installing SparkR package ? Please guide me how to proceed with this. As of now, I work with R only on single node. Please suggest Thanks Stuti Awasthi ::DISCLAIMER::

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-18 Thread Jeremy Lee
Ah, right. So only the launch script has changed. Everything else is still essentially binary compatible? Well, that makes it too easy! Thanks! On Wed, Jun 18, 2014 at 2:35 PM, Patrick Wendell pwend...@gmail.com wrote: Actually you'll just want to clone the 1.0 branch then use the spark-ec2

Re: question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-18 Thread santhoma
Thanks, I hope this problem will go away once I upgrade to spark 1.0 where we can send the clusterwide classpaths using spark-submit command -- View this message in context:

Re: Memory footprint of Calliope: Spark - Cassandra writes

2014-06-18 Thread tj opensource
Gerard, We haven't done a test on Calliope vs a driver. The thing is Calliope builds on C* thrift (and latest build on DS driver) and the performance in terms of simple write will be similar to any existing driver. But then that is not the use case for Calliope. It is built to be used from

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi Andrew, Strangely in my spark (1.0.0 compiled against hadoop 2.4.0) log, it says file not found. I'll try again. Jianshi On Wed, Jun 18, 2014 at 12:36 PM, Andrew Ash and...@andrewash.com wrote: In Spark you can use the normal globs supported by Hadoop's FileSystem, which are documented

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/jianshuang/data/parquet/table/month=2014*) will get list all the files. I'll test it again. Jianshi On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang

Re: Contribution to Spark MLLib

2014-06-18 Thread Jayati
Hello Xiangrui, Thanks for sharing the roadmap. I really helped. Regards, Jayati -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716p7826.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming Example with CDH5

2014-06-18 Thread Sean Owen
There is nothing special about CDH5 Spark in this regard. CDH 5.0.x has Spark 0.9.0, and the imminent next release will have 1.0.0 + upstream patches. You're simply accessing a class that was not present in 0.9.0, but is present after that:

Re: join operation is taking too much time

2014-06-18 Thread MEETHU MATHEW
Hi, Thanks Andrew and Daniel for the response. Setting spark.shuffle.spill to false didnt make any difference. 5 days   completed in 6 min and 10 days was stuck after around 1hr. Daniel,in my current use case I cant read all the files to a single RDD.But I have another use case where I did it

Re: Unit test failure: Address already in use

2014-06-18 Thread Anselme Vignon
Hi, Could your problem come from the fact that you run your tests in parallel ? If you are spark in local mode, you cannot have concurrent spark instances running. this means that your tests instantiating sparkContext cannot be run in parallel. The easiest fix is to tell sbt to not run parallel

Re: get schema from SchemaRDD

2014-06-18 Thread Michael Armbrust
We just merged a feature into master that lets you print the schema or view it as a string (printSchema() and schemaTreeString on SchemaRDD). There is also this JIRA targeting 1.1 for presenting a nice programatic API for this information: https://issues.apache.org/jira/browse/SPARK-2179 On

Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
You cannot assume that caching would always reduce the execution time, especially if the data-set is large. It appears that if too much memory is used for caching, then less memory is left for the actual computation itself. There has to be a balance between the two. Page 33 of this thesis from

Re: Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
I guess this is a basic question about the usage of reduce. Please shed some lights, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html Sent from the Apache Spark User List mailing

Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
In the test application, I create a DStream by connect with a socket. Then I want to count the RDDs in the DStream which matches with another reference RDD. Below is the Java code for my application. == public class TestSparkStreaming { public static void main(String[] args) {

BSP realization on Spark

2014-06-18 Thread Ghousia
Hi, We are trying to implement a BSP model in Spark with the help of GraphX. One thing I encountered is a Pregel operator in Graph class. But what I fail to understand is how the Master and Worker needs to be assigned (BSP), and how barrier synchronization would happen. The pregel operator

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman
Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with

Re: Contribution to Spark MLLib

2014-06-18 Thread Denis Turdakov
Hello everybody, Xiangrui, thanks for the link to roadmap. I saw it is planned to implement LDA in the MLlib 1.1. What do you think about PLSA? I understand that LDA is more popular now, but recent research shows that modifications of PLSA sometimes performs better[1]. Furthermore, the most

Re: rdd.cache() is not faster?

2014-06-18 Thread Wei Tan
Hi Gaurav, thanks for your pointer. The observation in the link is (at least qualitatively) similar to mine. Now the question is, if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem

Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem cache? The answer to the question of whether to persist (spill-over) data on disk is not always immediately clear, because generally

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
Is that month= syntax something special, or do your files actually have that string as part of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi Nicholas, month= is for Hive to auto discover the partitions. It's part of the url of my files. Jianshi On Wed, Jun 18, 2014 at 11:52 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is that month= syntax something special, or do your files actually have that string as part of

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
I wonder if that’s the problem. Is there an equivalent hadoop fs -ls command you can run that returns the same files you want but doesn’t have that month= string? ​ On Wed, Jun 18, 2014 at 12:25 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Nicholas, month= is for Hive to auto discover

HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Hi All, Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts is that this should be part of the staging and should be cleaned up as well when sc gets terminated.

Re: question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-18 Thread santhoma
by the way, any idea how to sync the spark config dir with other nodes in the cluster? ~santhosh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/question-about-setting-SPARK-CLASSPATH-IN-spark-env-sh-tp7809p7853.html Sent from the Apache Spark User List

RE: Unit test failure: Address already in use

2014-06-18 Thread Lisonbee, Todd
Disabling parallelExecution has worked for me. Other alternatives I’ve tried that also work include: 1. Using a lock – this will let tests execute in parallel except for those using a SparkContext. If you have a large number of tests that could execute in parallel, this can shave off some

Re: Unit test failure: Address already in use

2014-06-18 Thread Philip Ogren
In my unit tests I have a base class that all my tests extend that has a setup and teardown method that they inherit. They look something like this: var spark: SparkContext = _ @Before def setUp() { Thread.sleep(100L) //this seems to give spark more time to reset from the

Spark is now available via Homebrew

2014-06-18 Thread Nick Chammas
OS X / Homebrew users, It looks like you can now download Spark simply by doing: brew install apache-spark I’m new to Homebrew, so I’m not too sure how people are intended to use this. I’m guessing this would just be a convenient way to get the latest release onto your workstation, and from

Re: No Intercept for Python

2014-06-18 Thread Reza Zadeh
Hi Naftali, Yes you're right. For now please add a column of ones. We are working on adding a weighted regularization term, and exposing the scala intercept option in the python binding. Best, Reza On Mon, Jun 16, 2014 at 12:19 PM, Naftali Harris naft...@affirm.com wrote: Hi everyone, The

Re: Spark is now available via Homebrew

2014-06-18 Thread Matei Zaharia
Interesting, does anyone know the people over there who set it up? It would be good if Apache itself could publish packages there, though I’m not sure what’s involved. Since Spark just depends on Java and Python it should be easy for us to update. Matei On Jun 18, 2014, at 1:37 PM, Nick

Re: No Intercept for Python

2014-06-18 Thread Naftali Harris
Thanks Reza! :-D Naftali On Wed, Jun 18, 2014 at 1:47 PM, Reza Zadeh r...@databricks.com wrote: Hi Naftali, Yes you're right. For now please add a column of ones. We are working on adding a weighted regularization term, and exposing the scala intercept option in the python binding.

Re: Spark is now available via Homebrew

2014-06-18 Thread Sheryl John
Cool. Looked at the Pull Requests, the upgrade to 1.0.0 was just merged yesterday. https://github.com/Homebrew/homebrew/pull/30231 https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb On Wed, Jun 18, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Shivani Rao
I am trying to process a file that contains 4 log lines (not very long) and then write my parsed out case classes to a destination folder, and I get the following error: java.lang.OutOfMemoryError: Java heap space at

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Agreed, it would be better if Apache controlled or managed this directly. I think making such a change is just a matter of opening a new issue https://github.com/Homebrew/homebrew/issues/new on the Homebrew issue tracker. I believe that's how Spark made it in there in the first place--it was just

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Matei, You might want to comment on that issue Sherl linked to, or perhaps this one https://github.com/Homebrew/homebrew/issues/30228, to ask about how Apache can manage this going forward. I know that mikemcquaid https://github.com/mikemcquaid is very active on the Homebrew repo and is one of

Re: Spark is now available via Homebrew

2014-06-18 Thread Andrew Ash
What's the advantage of Apache maintaining the brew installer vs users? Apache handling it means more work on this dev team, but probably a better experience for brew users. Just wanted to weigh pros/cons before committing to support this installation method. Andrew On Wed, Jun 18, 2014 at

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Andrew Ash
Wait, so the file only has four lines and the job running out of heap space? Can you share the code you're running that does the processing? I'd guess that you're doing some intense processing on every line but just writing parsed case classes back to disk sounds very lightweight. I On Wed,

java.lang.OutOfMemoryError with saveAsTextFile

2014-06-18 Thread Muttineni, Vinay
Hi, I have a 5 million record, 300 column data set. I am running a spark job in yarn-cluster mode, with the following args --driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors 500 The spark job replaces all categorical variables with some integers. I am getting the below

Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the

Re: Issue while trying to aggregate with a sliding window

2014-06-18 Thread Hatch M
Ok that patch does fix the key lookup exception. However, curious about the time validity check..isValidTime ( https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L264 ) Why does (time - zerotime) have to be a multiple of slide

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my use case I'd like to receive events and call an external

Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Thanks for the quick reply soumya. Unfortunately I'm a newbie with Spark..what do you mean? is there any reference to how to do that? On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com wrote: You can add a back pressured enabled component in front that feeds data into

create SparkContext dynamically

2014-06-18 Thread jamborta
Hi all, I am setting up a system where spark contexts would be created by a web server that would handle the computation and return the results. I have the following code (in python) os.environ['SPARK_HOME'] = /home/spark/spark-1.0.0-bin-hadoop2/ sc =

Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
I have a flow that ends with saveAsTextFile() to HDFS. It seems all the expected files per partition have been written out, based on the number of part files and the file sizes. But the driver logs show 2 tasks still not completed and has no activity and the worker logs show no activity for

Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nick Chammas
The following is a simplified example of what I am trying to accomplish. Say I have an RDD of objects like this: { country: USA, name: Franklin, age: 24, hits: 224} { country: USA, name: Bob, age: 55, hits: 108} { country: France, name: Remi, age:

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Doris Xin
Hi Nick, Instead of using reduceByKey(), you might want to look into using aggregateByKey(), which allows you to return a different value type U instead of the input value type V for each input tuple (K, V). You can define U to be a datatype that holds both the average and total and have seqOp

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
Ah, this looks like exactly what I need! It looks like this was recently added into PySpark https://github.com/apache/spark/pull/705/files#diff-6 (and Spark Core), but it's not in the 1.0.0 release. Thank you. Nick On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin doris.s@gmail.com wrote: Hi

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks
This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234), MyRecord(USA, Bob, 55, 108),

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Matei Zaharia
I was going to suggest the same thing :). On Jun 18, 2014, at 4:56 PM, Evan R. Sparks evan.spa...@gmail.com wrote: This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int,

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
That’s pretty neat! So I guess if you start with an RDD of objects, you’d first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'], ...)) in order to register it as a table, and from there run your aggregates. Very nice. ​ On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Zongheng Yang
If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s pretty neat! So I guess if you

Re: Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
Looks like eventually there was some type of reset or timeout and the tasks have been reassigned. I'm guessing they'll keep failing until max failure count. The machine it disconnected from was a remote machine, though I've seen such failures from connections to itself with other problems. The

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
This is exciting! Here is the relevant alpha doc http://yhuai.github.io/site/sql-programming-guide.html#json-datasets for this feature, for others reading this. I'm going to try this out. Will this be released with 1.1.0? On Wed, Jun 18, 2014 at 8:31 PM, Zongheng Yang zonghen...@gmail.com

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
Flavio - i'm new to Spark as well but I've done stream processing using other frameworks. My comments below are not spark-streaming specific. Maybe someone who know more can provide better insights. I read your post on my phone and I believe my answer doesn't completely address the issue you have

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-18 Thread Xiangrui Meng
Hi Bharath, This is related to SPARK-1112, which we already found the root cause. I will let you know when this is fixed. Best, Xiangrui On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Couple more points: 1)The inexplicable stalling of execution with large

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-18 Thread Bharath Ravi Kumar
Thanks. I'll await the fix to re-run my test. On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng men...@gmail.com wrote: Hi Bharath, This is related to SPARK-1112, which we already found the root cause. I will let you know when this is fixed. Best, Xiangrui On Tue, Jun 17, 2014 at 7:37 PM,

Fwd: BSP realization on Spark

2014-06-18 Thread Ghousia
-- Forwarded message -- From: Ghousia ghousia.ath...@gmail.com Date: Wed, Jun 18, 2014 at 5:41 PM Subject: BSP realization on Spark To: user@spark.apache.org Hi, We are trying to implement a BSP model in Spark with the help of GraphX. One thing I encountered is a Pregel operator

options set in spark-env.sh is not reflecting on actual execution

2014-06-18 Thread MEETHU MATHEW
Hi all, I have a doubt regarding the options in spark-env.sh. I set the following values in the file in master and 2 workers SPARK_WORKER_MEMORY=7g SPARK_EXECUTOR_MEMORY=6g SPARK_DAEMON_JAVA_OPTS+=- Dspark.akka.timeout=30 -Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80

Re: Best practices for removing lineage of a RDD or Graph object?

2014-06-18 Thread dash
Hi Roy, Thanks for your help, I write a small code snippet that could reproduce the problem. Could you help me read through it and see if I did anything wrong? Thanks! def main(args: Array[String]) { val conf = new SparkConf().setAppName(“TEST) .setMaster(local[4])