SPARK_MASTER_IP

2014-09-13 Thread Koert Kuipers
a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and sbin/start-slaves.sh are the only ones that use it. yet for example in CDH5 the spark-master is started from /etc/init.d/spark-master by running bin/spark-class. does that means SPARK_MASTER_IP is simply ignored? it looks like that to

Re: How to initialize StateDStream

2014-09-13 Thread qihong
there's no need to initialize StateDStream. Take a look at example StatefulNetworkWordCount.scala, it's part of spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html Sent from the Apache Spark

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
Is it always true that whenever we apply operations on an RDD, we get another RDD? Or does it depend on the return type of the operation? On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com wrote: An RDD is a fault-tolerant distributed structure. It is the primary

Re: How to save mllib model to hdfs and reload it

2014-09-13 Thread Yanbo Liang
Shixiong, These two snippets behave different in Scala. In the second snippet, you define variable named m and does evaluate the right hand size as part of the definition. In other words, the variable was replaced by the pre-computed value of Array(1.0) in the subsequently code. So in the second

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
This is all covered in http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations By definition, RDD transformations take an RDD to another RDD; actions produce some other type as a value on the driver program. On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan

Re: Serving data

2014-09-13 Thread Mayur Rustagi
You can cache data in memory query it using Spark Job Server.  Most folks dump data down to a queue/db for retrieval  You can batch up data store into parquet partitions as well. query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.  -- Regards, Mayur

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
Take for example this: *val lines = sc.textFile(args(0))* *val nodes = lines.map(s ={ * *val fields = s.split(\\s+)* *(fields(0),fields(1))* *}).distinct().groupByKey().cache() * *val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))* *val rootNode =

Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi
Thanks Xiangrui. This file already exists w/o escapes. I could probably try to preprocess it and add the escaping. On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng men...@gmail.com wrote: I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option:

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
Again, RDD operations are of two basic varieties: transformations, that produce further RDDs; and operations, that return values to the driver program. You've used several RDD transformations and then finally the top(1) action, which returns an array of one element to your driver program. That

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang yanboha...@gmail.com: Hi All, I found that

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet,

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread DB Tsai
Hi Yanbo, We made the change here https://github.com/apache/spark/commit/5d25c0b74f6397d78164b96afb8b8cbb1b15cfbd Those apis to set the parameters are very difficult to maintain, so we decide not to provide them. In next release, Spark 1.2, we will have a better api design for parameter setting.

RDDs and Immutability

2014-09-13 Thread Deep Pradhan
Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with

ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) = a. It seems to be a relatively

Re: Serving data

2014-09-13 Thread andy petrella
however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded. Using an off heap cache like tachyon as a dump repo can help. In general, I'd say that using a

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
If you are just looking for distinct keys, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com wrote: Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find

Re: JMXSink for YARN deployment

2014-09-13 Thread Otis Gospodnetic
Hi, Jerry said I'm guessing, so maybe the thing to try is to check if his guess is correct. What about running sudo lsof | grep metrics.properties ? I imagine you should be able to see it if the file was found and read. If Jerry is right, then I think you will NOT see it. Next, how about

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily equal, though (but I can dump any on the ones that have identical key). 2014-09-13 12:50 GMT+02:00 Sean Owen so...@cloudera.com: If you are just looking for distinct

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-13 Thread Sean Owen
Your app is the running Spark Streaming system. It would be up to you to build some mechanism that lets you cause it to call stop() in response to some signal from you. On Fri, Sep 12, 2014 at 3:59 PM, stanley wangshua...@yahoo.com wrote: In spark streaming programming document

Re: spark 1.1 failure. class conflict?

2014-09-13 Thread Sean Owen
No, your error is right there in the logs. Unset SPARK_CLASSPATH. On Fri, Sep 12, 2014 at 10:20 PM, freedafeng freedaf...@yahoo.com wrote: : org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.

Re: compiling spark source code

2014-09-13 Thread kkptninja
Hi, I took am having problem with compiling Spark from source. However, my problem is different. I downloaded latest version (1.1.0) and ran ./sbt/sbt assembly from the command line. I end up with the following error [info] SHA-1: 20abd673d1e0690a6d5b64951868eef8d332d084 [info] Packaging

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
I had looked at that. If I have a set of saved word counts from previous run, and want to load that in the next run, what is the best way to do it? I am thinking of hacking the Spark code and have an initial rdd in StateDStream, and use that in for the first time. On Fri, Sep 12, 2014 at 11:04

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
This is more concise: x.groupBy(obj.fieldtobekey).values.map(_.head) ... but I doubt it's faster. If all objects with the same fieldtobekey are within the same partition, then yes I imagine your biggest speedup comes from exploiting that. How about ... x.mapPartitions(_.map(obj =

Re: RDDs and Immutability

2014-09-13 Thread Nicholas Chammas
Have you tried using RDD.map() to transform some of the RDD elements from 0 to 1? Why doesn’t that work? That’s how you change data in Spark, by defining a new RDD that’s a transformation of an old one. ​ On Sat, Sep 13, 2014 at 5:39 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, We all

Re: compiling spark source code

2014-09-13 Thread Ted Yu
bq. [error] (repl/compile:compile) Compilation failed Can you pastebin more of the output ? Cheers

Re: Nested Case Classes (Found and Required Same)

2014-09-13 Thread Ramaraju Indukuri
Upgraded to 1.1 and the issue is resolved. Thanks. I still wonder if there is a better way to approach a large attribute dataset. On Fri, Sep 12, 2014 at 12:20 PM, Prashant Sharma scrapco...@gmail.com wrote: What is your spark version ? This was fixed I suppose. Can you try it with latest

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
OK, mapPartition seems to be the way to go. Thanks for the help! Le 13 sept. 2014 16:41, Sean Owen so...@cloudera.com a écrit : This is more concise: x.groupBy(obj.fieldtobekey).values.map(_.head) ... but I doubt it's faster. If all objects with the same fieldtobekey are within the same

Write 1 RDD to multiple output paths in one go

2014-09-13 Thread Nick Chammas
Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example. I have an RDD of (key, value) pairs like this: a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'),

Re: compiling spark source code

2014-09-13 Thread kkptninja
Hi Ted, Thanks for the prompt reply :) please find details of the issue at this url http://pastebin.com/Xt0hZ38q http://pastebin.com/Xt0hZ38q Kind Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html

Re: How to initialize StateDStream

2014-09-13 Thread qihong
I'm not sure what you mean by previous run. Is it previous batch? or previous run of spark-submit? If it's previous batch (spark streaming creates a batch every batch interval), then there's nothing to do. If it's previous run of spark-submit (assuming you are able to save the result somewhere),

Re: compiling spark source code

2014-09-13 Thread Ted Yu
bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja kkptni...@gmail.com wrote: Hi Ted, Thanks for the prompt reply :) please find details of the issue at this

Workload for spark testing

2014-09-13 Thread 牛兆捷
Hi All: We know some memory of spark are used for computing (e.g., spark.shuffle.memoryFraction) and some are used for caching RDD for future use (e.g., spark.storage.memoryFraction). Is there any existing workload which can utilize both of them during the running left cycle? I want to do some

Re: compiling spark source code

2014-09-13 Thread Yin Huai
Can you try sbt/sbt clean first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch. 2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for

Re: spark 1.1.0 unit tests fail

2014-09-13 Thread Andrew Or
Hi Koert, Thanks for reporting this. These tests have been flaky even on the master branch for a long time. You can safely disregard these test failures, as the root cause is port collisions from the many SparkContexts we create over the course of the entire test. There is a patch that fixes this

Spark SQL

2014-09-13 Thread rkishore999
val file = sc.textFile(hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt) 1. val xyz = file.map(line = extractCurRate(sqlContext.sql(select rate from CurrencyCodeRates where txCurCode = ' + line.substring(202,205) + ' and fxCurCode = ' +