[VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.1 [ ] -1 Do not release this package because ...

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Data that are not updated should be saved earlier: while the data added to the DStream at the first time, it should be considered as updated. So save the same data again is a waste. What are the community is doing? Is there any doc or discussion that I can look for? Thanks. Shixiong Zhu

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
You can create connection like this: val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { val dbConnection = create a db connection iterator.flatMap { case (key, values, stateOption) => if (values.isEmpty) { // don't access database

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
Could you write your update func like this? val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap { case (key, values, stateOption) => if (values.isEmpty) { // don't access database } else { // update to new

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
It seems like a work around. But I don't know how to get the database connection from the working nodes. Shixiong Zhu 于2015年9月24日周四 下午5:37写道: > Could you write your update func like this? > > val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) > => { >

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
For data that are not updated, where do you save? Or do you only want to avoid accessing database for those that are not updated? Besides, the community is working on optimizing "updateStateBykey"'s performance. Hope it will be delivered soon. Best Regards, Shixiong Zhu 2015-09-24 13:45

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen
+1 non-binding. This is the first time I've seen all tests pass the first time with Java 8 + Ubuntu + "-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver". Clearly the test improvement efforts are paying off. As usual the license, sigs, etc are OK. On Thu, Sep 24, 2015 at 8:27 AM, Reynold Xin

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-24 Thread shane knapp
this is happening now. On Tue, Sep 22, 2015 at 10:07 AM, shane knapp wrote: > ok, here's the updated downtime schedule for this week: > > wednesday, sept 23rd: > > firewall maintenance cancelled, as jon took care of the update > saturday morning while we were bringing

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Thanks, it seems good, though a little hack. And here is another question. updateByKey compute on all the data from the beginning, but in many situation, we just need to update the coming data. This could be a big improve on speed and resource. Would this to be support in the future? Shixiong

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-24 Thread shane knapp
...and we're finished and now building! On Thu, Sep 24, 2015 at 7:19 AM, shane knapp wrote: > this is happening now. > > On Tue, Sep 22, 2015 at 10:07 AM, shane knapp wrote: >> ok, here's the updated downtime schedule for this week: >> >> wednesday,

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Richard Hillegas
Thanks for forking the new email thread, Reynold. It is entirely possible that I am being overly skittish. I have posed a question for our legal experts: https://issues.apache.org/jira/browse/LEGAL-226 To answer Sean's question on the previous email thread, I would propose making changes like

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen
On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas wrote: > Under your guidance, I would be happy to help compile a NOTICE file which > follows the pattern used by Derby and the JDK. This effort might proceed in > parallel with vetting 1.5.1 and could be targeted at a later

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Sean Owen
Have a look at http://www.apache.org/dev/licensing-howto.html#mod-notice though, which makes a good point about limiting what goes into NOTICE to what is required. That's what makes me think we shouldn't do this. On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas wrote: > To

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Sean Owen
Yes, the issue of where 3rd-party license information goes is different, and varies by license. I think the BSD/MIT licenses are all already listed in LICENSE accordingly. Let me know if you spy an omission. On Thu, Sep 24, 2015 at 8:36 PM, Richard Hillegas wrote: > Thanks

[Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Reynold Xin
Richard, Thanks for bringing this up and this is a great point. Let's start another thread for it so we don't hijack the release thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: >

Re: SparkR package path

2015-09-24 Thread Shivaram Venkataraman
I don't think the crux of the problem is about users who download the source -- Spark's source distribution is clearly marked as something that needs to be built and they can run `mvn -DskipTests -Psparkr package` based on instructions in the Spark docs. The crux of the problem is that with a

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Richard Hillegas
Thanks for that pointer, Sean. It may be that Derby is putting the license information in the wrong place, viz. in the NOTICE file. But the 3rd party license text may need to go somewhere else. See for instance the advice a little further up the page at

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Richard Hillegas
Hi Sean and Wendell, I share your concerns about how difficult and important it is to get this right. I think that the Spark community has compiled a very readable and well organized NOTICE file. A lot of careful thought went into gathering together 3rd party projects which share the same

Re: SparkR package path

2015-09-24 Thread Hossein
Requiring users to download entire Spark distribution to connect to a remote cluster (which is already running Spark) seems an over kill. Even for most spark users who download Spark source, it is very unintuitive that they need to run a script named "install-dev.sh" before they can run SparkR.

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Patrick Wendell
Hey Richard, My assessment (just looked before I saw Sean's email) is the same as his. The NOTICE file embeds other projects' licenses. If those licenses themselves have pointers to other files or dependencies, we don't embed them. I think this is standard practice. - Patrick On Thu, Sep 24,

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I'm going to +1 this myself. Tested on my laptop. On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: > I forked a new thread for this. Please discuss NOTICE file related things > there so it doesn't hijack this thread. > > > On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen

Re: SparkR package path

2015-09-24 Thread Hossein
Right now in sparkR.R the backend hostname is hard coded to "localhost" ( https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156). If we make that address configurable / parameterized, then a user can connect a remote Spark cluster with no need to have spark jars on their local

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Xiangrui Meng
+1. Checked user guide and API doc, and ran some MLlib and SparkR examples. -Xiangrui On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote: > I'm going to +1 this myself. Tested on my laptop. > > > > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: >>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Hossein
+1 tested SparkR on Mac and Linux. --Hossein On Thu, Sep 24, 2015 at 3:10 PM, Xiangrui Meng wrote: > +1. Checked user guide and API doc, and ran some MLlib and SparkR > examples. -Xiangrui > > On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote: > > I'm

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry
Hi Fengdong, Thanks for your question. Spark already has a function called wholeTextFiles within sparkContext which can help you with that: Python hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1 ...hdfs://a-hdfs-path/part-n rdd =

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Richard Hillegas
-1 (non-binding) I was able to build Spark cleanly from the source distribution using the command in README.md: build/mvn -DskipTests clean package However, while I was waiting for the build to complete, I started going through the NOTICE file. I was confused about where to find licenses

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen
Hi Richard, those are messages reproduced from other projects' NOTICE files, not created by Spark. They need to be reproduced in Spark's NOTICE file to comply with the license, but their text may or may not apply to Spark's distribution. The intent is that users would track this back to the source

Re: SparkR package path

2015-09-24 Thread Luciano Resende
For host information, are you looking for something like this (which is available today in Spark 1.5 already) ? # Spark related configuration Sys.setenv("SPARK_MASTER_IP"="127.0.0.1") Sys.setenv("SPARK_LOCAL_IP"="127.0.0.1") #Load libraries library("rJava") library(SparkR,

RE: SparkR package path

2015-09-24 Thread Sun, Rui
Yes, the current implementation requires the backend to be on the same host as SparkR package. But this does not prevent SparkR from connecting to a remote Spark Cluster specified by a Spark master URL. The only thing needed is that there need be to a Spark JAR co-located with SparkR package on

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi Anchit, Thanks for the quick answer. my exact question is : I want to add HDFS location into each line in my JSON data. > On Sep 25, 2015, at 11:25, Anchit Choudhry wrote: > > Hi Fengdong, > > Thanks for your question. > > Spark already has a function

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry
Hi Fengdong, So I created two files in HDFS under a test folder. test/dt=20100101.json { "key1" : "value1" } test/dt=20100102.json { "key2" : "value2" } Then inside PySpark shell rdd = sc.wholeTextFiles('./test/*') rdd.collect() [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json',

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Luciano Resende
+1 (non-binding) Compiled in Mac OS with : build/mvn -Pyarn,sparkr,hive,hive-thriftserver -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package Checked around R Looked into legal files All looks good. On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin wrote: > Please

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Joseph Bradley
+1 Tested MLlib on Mac OS X On Thu, Sep 24, 2015 at 6:14 PM, Reynold Xin wrote: > Krishna, > > Thanks for testing every release! > > > On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar > wrote: > >> +1 (non-binding, of course) >> >> 1. Compiled OSX

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
Krishna, Thanks for testing every release! On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar wrote: > +1 (non-binding, of course) > > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min > mvn clean package -Pyarn -Phadoop-2.6 -DskipTests > 2. Tested pyspark,

RE: SparkR package path

2015-09-24 Thread Sun, Rui
If a user downloads Spark source, of course he needs to build it before running it. But a user can download pre-built Spark binary distributions, then he can directly use sparkR after deployment of the Spark cluster. From: Hossein [mailto:fal...@gmail.com] Sent: Friday, September 25, 2015 2:37

How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi, I have multiple files with JSON format, such as: /data/test1_data/sub100/test.data /data/test2_data/sub200/test.data I can sc.textFile(“/data/*/*”) but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it the one target HDFS location. how to do it, Thanks.

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean McNamara
Ran tests + built/ran an internal spark streaming app /w 1.5.1 artifacts. +1 Cheers, Sean On Sep 24, 2015, at 1:28 AM, Reynold Xin > wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
yes. such as I have two data sets: date set A: /data/test1/dt=20100101 data set B: /data/test2/dt=20100202 all data has the same JSON format , such as: {“key1” : “value1”, “key2” : “value2” } my output expected: {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}