date:20150924

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Luciano Resende

+1 (non-binding)

Compiled in Mac OS with :
build/mvn -Pyarn,sparkr,hive,hive-thriftserver
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package

Checked around R
Looked into legal files

All looks good.


On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
>
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1148/
> *
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
> present in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry

Hi Fengdong,

So I created two files in HDFS under a test folder.

test/dt=20100101.json
{ "key1" : "value1" }

test/dt=20100102.json
{ "key2" : "value2" }

Then inside PySpark shell

rdd = sc.wholeTextFiles('./test/*')
rdd.collect()
[(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{ "key1" :
"value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json',
u'{ "key2" : "value2" })]
import json
def editMe(y, x):
  j = json.loads(y)
  j['source'] = x
  return j

rdd.map(lambda (x,y): editMe(y,x)).collect()
[{'source': u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json',
u'key1': u'value1'}, {u'key2': u'value2', 'source': u'hdfs://localhost
:9000/user/hduser/test/dt=20100102.json'}]

Similarly you could modify the function to return 'source' and 'date' with
some string manipulation per your requirements.

Let me know if this helps.

Thanks,
Anchit


On 24 September 2015 at 23:55, Fengdong Yu  wrote:

>
> yes. such as I have two data sets:
>
> date set A: /data/test1/dt=20100101
> data set B: /data/test2/dt=20100202
>
>
> all data has the same JSON format , such as:
> {“key1” : “value1”, “key2” : “value2” }
>
>
> my output expected:
> {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” :
> “20100101"}
> {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” :
> “20100202"}
>
>
> On Sep 25, 2015, at 11:52, Anchit Choudhry 
> wrote:
>
> Sure. May I ask for a sample input(could be just few lines) and the output
> you are expecting to bring clarity to my thoughts?
>
> On Thu, Sep 24, 2015, 23:44 Fengdong Yu  wrote:
>
>> Hi Anchit,
>>
>> Thanks for the quick answer.
>>
>> my exact question is : I want to add HDFS location into each line in my
>> JSON  data.
>>
>>
>>
>> On Sep 25, 2015, at 11:25, Anchit Choudhry 
>> wrote:
>>
>> Hi Fengdong,
>>
>> Thanks for your question.
>>
>> Spark already has a function called wholeTextFiles within sparkContext
>> which can help you with that:
>>
>> Python
>>
>> hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
>> ...hdfs://a-hdfs-path/part-n
>>
>> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
>>
>> (a-hdfs-path/part-0, its content)
>> (a-hdfs-path/part-1, its content)
>> ...
>> (a-hdfs-path/part-n, its content)
>>
>> More info: http://spark.apache.org/docs/latest/api/python/pyspark
>> .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>>
>> 
>>
>> Scala
>>
>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>>
>> More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
>> apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>
>> Let us know if this helps or you need more help.
>>
>> Thanks,
>> Anchit Choudhry
>>
>> On 24 September 2015 at 23:12, Fengdong Yu 
>> wrote:
>>
>>> Hi,
>>>
>>> I have  multiple files with JSON format, such as:
>>>
>>> /data/test1_data/sub100/test.data
>>> /data/test2_data/sub200/test.data
>>>
>>>
>>> I can sc.textFile(“/data/*/*”)
>>>
>>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then
>>> save it the one target HDFS location.
>>>
>>> how to do it, Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean McNamara

Ran tests + built/ran an internal spark streaming app /w 1.5.1 artifacts.

+1

Cheers,

Sean


On Sep 24, 2015, at 1:28 AM, Reynold Xin 
mailto:r...@databricks.com>> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.1. 
The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority 
of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.1
[ ] -1 Do not release this package because ...


The release fixes 81 known issues in Spark 1.5.0, listed here:
http://s.apache.org/spark-1.5.1

The tag to be voted on is v1.5.1-rc1:
https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (1.5.1) can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1148/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.0. Bugs already present in 
1.5.0 will not block this release.

===
What should happen to JIRA tickets still targeting 1.5.1?
===
Please target 1.5.2 or 1.6.0.

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu


yes. such as I have two data sets:

date set A: /data/test1/dt=20100101
data set B: /data/test2/dt=20100202


all data has the same JSON format , such as:
{“key1” : “value1”, “key2” : “value2” }


my output expected:
{“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}
{“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : “20100202"}


> On Sep 25, 2015, at 11:52, Anchit Choudhry  wrote:
> 
> Sure. May I ask for a sample input(could be just few lines) and the output 
> you are expecting to bring clarity to my thoughts?
> 
> On Thu, Sep 24, 2015, 23:44 Fengdong Yu  > wrote:
> Hi Anchit, 
> 
> Thanks for the quick answer.
> 
> my exact question is : I want to add HDFS location into each line in my JSON  
> data.
> 
> 
> 
>> On Sep 25, 2015, at 11:25, Anchit Choudhry > > wrote:
>> 
>> Hi Fengdong,
>> 
>> Thanks for your question.
>> 
>> Spark already has a function called wholeTextFiles within sparkContext which 
>> can help you with that:
>> 
>> Python
>> hdfs://a-hdfs-path/part-0
>> hdfs://a-hdfs-path/part-1
>> ...
>> hdfs://a-hdfs-path/part-n
>> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”)
>> (a-hdfs-path/part-0, its content)
>> (a-hdfs-path/part-1, its content)
>> ...
>> (a-hdfs-path/part-n, its content)
>> More info: http://spark 
>> .apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>> 
>> 
>> 
>> Scala
>> 
>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>> 
>> More info: 
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>  
>> Let us know if this helps or you need more help.
>> 
>> Thanks,
>> Anchit Choudhry
>> 
>> On 24 September 2015 at 23:12, Fengdong Yu > > wrote:
>> Hi,
>> 
>> I have  multiple files with JSON format, such as:
>> 
>> /data/test1_data/sub100/test.data
>> /data/test2_data/sub200/test.data
>> 
>> 
>> I can sc.textFile(“/data/*/*”)
>> 
>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save 
>> it the one target HDFS location.
>> 
>> how to do it, Thanks.
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>> 
>> 
>> 
>

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry

Sure. May I ask for a sample input(could be just few lines) and the output
you are expecting to bring clarity to my thoughts?

On Thu, Sep 24, 2015, 23:44 Fengdong Yu  wrote:

> Hi Anchit,
>
> Thanks for the quick answer.
>
> my exact question is : I want to add HDFS location into each line in my
> JSON  data.
>
>
>
> On Sep 25, 2015, at 11:25, Anchit Choudhry 
> wrote:
>
> Hi Fengdong,
>
> Thanks for your question.
>
> Spark already has a function called wholeTextFiles within sparkContext
> which can help you with that:
>
> Python
>
> hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
> ...hdfs://a-hdfs-path/part-n
>
> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
>
> (a-hdfs-path/part-0, its content)
> (a-hdfs-path/part-1, its content)
> ...
> (a-hdfs-path/part-n, its content)
>
> More info: http://spark.apache.org/docs/latest/api/python/pyspark
> .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>
> 
>
> Scala
>
> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>
> More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
> apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>
> Let us know if this helps or you need more help.
>
> Thanks,
> Anchit Choudhry
>
> On 24 September 2015 at 23:12, Fengdong Yu 
> wrote:
>
>> Hi,
>>
>> I have  multiple files with JSON format, such as:
>>
>> /data/test1_data/sub100/test.data
>> /data/test2_data/sub200/test.data
>>
>>
>> I can sc.textFile(“/data/*/*”)
>>
>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then
>> save it the one target HDFS location.
>>
>> how to do it, Thanks.
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu

Hi Anchit, 

Thanks for the quick answer.

my exact question is : I want to add HDFS location into each line in my JSON  
data.


> On Sep 25, 2015, at 11:25, Anchit Choudhry  wrote:
> 
> Hi Fengdong,
> 
> Thanks for your question.
> 
> Spark already has a function called wholeTextFiles within sparkContext which 
> can help you with that:
> 
> Python
> hdfs://a-hdfs-path/part-0
> hdfs://a-hdfs-path/part-1
> ...
> hdfs://a-hdfs-path/part-n
> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
> (a-hdfs-path/part-0, its content)
> (a-hdfs-path/part-1, its content)
> ...
> (a-hdfs-path/part-n, its content)
> More info: http://spark 
> .apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
> 
> 
> 
> Scala
> 
> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
> 
> More info: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>  
> Let us know if this helps or you need more help.
> 
> Thanks,
> Anchit Choudhry
> 
> On 24 September 2015 at 23:12, Fengdong Yu  > wrote:
> Hi,
> 
> I have  multiple files with JSON format, such as:
> 
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
> 
> 
> I can sc.textFile(“/data/*/*”)
> 
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it 
> the one target HDFS location.
> 
> how to do it, Thanks.
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: dev-h...@spark.apache.org 
> 
> 
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Joseph Bradley

+1  Tested MLlib on Mac OS X

On Thu, Sep 24, 2015 at 6:14 PM, Reynold Xin  wrote:

> Krishna,
>
> Thanks for testing every release!
>
>
> On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar 
> wrote:
>
>> +1 (non-binding, of course)
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
>>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>> 2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
>> “conda install python” and then “conda install jupyter”)
>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>Center And Scale OK
>> 2.5. RDD operations OK
>>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>Model evaluation/optimization (rank, numIter, lambda) with
>> itertools OK
>> 3. Scala - MLlib
>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>> 3.2. LinearRegressionWithSGD OK
>> 3.3. Decision Tree OK
>> 3.4. KMeans OK
>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>> 3.6. saveAsParquetFile OK
>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>> registerTempTable, sql OK
>> 3.8. result = sqlContext.sql("SELECT
>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>> 4.0. Spark SQL from Python OK
>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
>> 5.0. Packages
>> 5.1. com.databricks.spark.csv - read/write OK (--packages
>> com.databricks:spark-csv_2.10:1.2.0)
>> 6.0. DataFrames
>> 6.1. cast,dtypes OK
>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>> 6.3. All joins,sql,set operations,udf OK
>> *Notes:*
>> 1. Speed improvement in DataFrame functions groupBy, avg,sum et al. *Good
>> work*. I am working on a project to reduce processing time from ~24 hrs
>> to ... Let us see what Spark does. The speedups would help a lot.
>> 2. FYI, UDFs getM and getY work now (Thanks). Slower; saturates the CPU.
>> A non-scientific snapshot below. I know that this really has to be done
>> more rigorously, on a bigger machine, with more cores et al..
>> [image: Inline image 1] [image: Inline image 2]
>>
>> On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>>> http://s.apache.org/spark-1.5.1
>>>
>>> The tag to be voted on is v1.5.1-rc1:
>>>
>>> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release (1.5.1) can be found at:
>>> *https://repository.apache.org/content/repositories/orgapachespark-1148/
>>> *
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>>
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> What justifies a -1 vote for this release?
>>> 
>>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>>> present in 1.5.0 will not block this release.
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 1.5.1?
>>> ===
>>> Please target 1.5.2 or 1.6.0.
>>>
>>>
>>>
>>>
>>
>

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Anchit Choudhry

Hi Fengdong,

Thanks for your question.

Spark already has a function called wholeTextFiles within sparkContext
which can help you with that:

Python

hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
...hdfs://a-hdfs-path/part-n

rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)

(a-hdfs-path/part-0, its content)
(a-hdfs-path/part-1, its content)
...
(a-hdfs-path/part-n, its content)

More info: http://spark.apache.org/docs/latest/api/python/pyspark
.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles

Scala

val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")

More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]

Let us know if this helps or you need more help.

Thanks,
Anchit Choudhry

On 24 September 2015 at 23:12, Fengdong Yu  wrote:

> Hi,
>
> I have  multiple files with JSON format, such as:
>
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
>
>
> I can sc.textFile(“/data/*/*”)
>
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save
> it the one target HDFS location.
>
> how to do it, Thanks.
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu

Hi,

I have  multiple files with JSON format, such as:

/data/test1_data/sub100/test.data
/data/test2_data/sub200/test.data


I can sc.textFile(“/data/*/*”)

but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it 
the one target HDFS location. 

how to do it, Thanks.






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: SparkR package path

2015-09-24 Thread Sun, Rui

Yes, the current implementation requires the backend to be on the same host as 
SparkR package. But this does not prevent SparkR from connecting to a remote 
Spark Cluster specified by a Spark master URL. The only thing needed is that 
there need be to a Spark JAR co-located with SparkR package on the same client 
machine. This is similar to any Spark application, which also depends on Spark 
JAR.

Theoritically, as SparkR package communicates with the backend via socket, the 
backend could be running on a different host. But this will make the launching 
of SparkR more complex, requiring not small change to spark-submit. Also 
additional network traffic overhead would be incurred.  I can’t see any 
compelling demand for this.

From: Hossein [mailto:fal...@gmail.com]
Sent: Friday, September 25, 2015 5:09 AM
To: shiva...@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org; Dan Putler
Subject: Re: SparkR package path

Right now in sparkR.R the backend hostname is hard coded to "localhost" 
(https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can connect a 
remote Spark cluster with no need to have spark jars on their local machine. I 
have got this request from some R users. Their company has a Spark cluster 
(usually managed by another team), and they want to connect to it from their 
workstation (e.g., from within RStudio, etc).

--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>> wrote:
I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein 
mailto:fal...@gmail.com>> wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui 
> mailto:rui@intel.com>> wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:fal...@gmail.com]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shiva...@eecs.berkeley.edu
>> Cc: Sun, Rui; dev@spark.apache.org
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> mailto:shiva...@eecs.berkeley.edu>> wrote:
>>
>> As Rui says it would be good to understand the use case we want to
>> support (supporting CRAN installs could be one for example). I don't
>> think it should be very hard to do as the RBackend itself doesn't use
>> the R source files. The RRDD does use it and the value comes from
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> AFAIK -- So we could introduce a new config flag that can be used for
>> this new mode.
>>
>> Thanks
>> Shivaram
>>
>>
>> On Mo

RE: SparkR package path

2015-09-24 Thread Sun, Rui

If  a user downloads Spark source, of course he needs to build it before 
running it. But a user can download pre-built Spark binary distributions, then 
he can directly use sparkR after deployment of the Spark cluster.

From: Hossein [mailto:fal...@gmail.com]
Sent: Friday, September 25, 2015 2:37 AM
To: Sun, Rui
Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org
Subject: Re: SparkR package path

Requiring users to download entire Spark distribution to connect to a remote 
cluster (which is already running Spark) seems an over kill. Even for most 
spark users who download Spark source, it is very unintuitive that they need to 
run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui 
mailto:rui@intel.com>> wrote:
SparkR package is not a standalone R package, as it is actually R API of Spark 
and needs to co-operate with a matching version of Spark, so exposing it in 
CRAN does not ease use of R users as they need to download matching Spark 
distribution, unless we expose a bundled SparkR package to CRAN (packageing 
with Spark), is this desirable? Actually, for normal users who are not 
developers, they are not required to download Spark source, build and install 
SparkR package. They just need to download a Spark distribution, and then use 
SparkR.

For using SparkR in Rstudio, there is a documentation at 
https://github.com/apache/spark/tree/master/R

From: Hossein [mailto:fal...@gmail.com]
Sent: Thursday, September 24, 2015 1:42 AM
To: shiva...@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both 
SparkR and Spark itself to a larger community of data scientists (and 
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these 
folks have a Spark Cluster and wish to talk to it from RStudio. While that is a 
bigger task, for now, first step could be not requiring them to download Spark 
source and run a script that is named install-dev.sh. I filed SPARK-10776 to 
track this.

--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui 
mailto:rui@intel.com>> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin

Krishna,

Thanks for testing every release!


On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar  wrote:

> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
> “conda install python” and then “conda install jupyter”)
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages
> com.databricks:spark-csv_2.10:1.2.0)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
> *Notes:*
> 1. Speed improvement in DataFrame functions groupBy, avg,sum et al. *Good
> work*. I am working on a project to reduce processing time from ~24 hrs
> to ... Let us see what Spark does. The speedups would help a lot.
> 2. FYI, UDFs getM and getY work now (Thanks). Slower; saturates the CPU. A
> non-scientific snapshot below. I know that this really has to be done more
> rigorously, on a bigger machine, with more cores et al..
> [image: Inline image 1] [image: Inline image 2]
>
> On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>> http://s.apache.org/spark-1.5.1
>>
>> The tag to be voted on is v1.5.1-rc1:
>>
>> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (1.5.1) can be found at:
>> *https://repository.apache.org/content/repositories/orgapachespark-1148/
>> *
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>> present in 1.5.0 will not block this release.
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.1?
>> ===
>> Please target 1.5.2 or 1.6.0.
>>
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Krishna Sankar

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
“conda install python” and then “conda install jupyter”)
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.2.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
*Notes:*
1. Speed improvement in DataFrame functions groupBy, avg,sum et al. *Good
work*. I am working on a project to reduce processing time from ~24 hrs to
... Let us see what Spark does. The speedups would help a lot.
2. FYI, UDFs getM and getY work now (Thanks). Slower; saturates the CPU. A
non-scientific snapshot below. I know that this really has to be done more
rigorously, on a bigger machine, with more cores et al..
[image: Inline image 1] [image: Inline image 2]

On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
>
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1148/
> *
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
> present in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>
>

Re: SparkR package path

2015-09-24 Thread Luciano Resende

For host information, are you looking for something like this (which is
available today in Spark 1.5 already) ?

# Spark related configuration
Sys.setenv("SPARK_MASTER_IP"="127.0.0.1")
Sys.setenv("SPARK_LOCAL_IP"="127.0.0.1")

#Load libraries
library("rJava")
library(SparkR, lib.loc="/./spark-bin/R/lib")

#Initalize  spark context
sc <- sparkR.init(sparkHome = "/./spark-bin",
sparkPackages="com.databricks:spark-csv_2.11:1.2.0")



On Thu, Sep 24, 2015 at 2:09 PM, Hossein  wrote:

> Right now in sparkR.R the backend hostname is hard coded to "localhost" (
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).
>
> If we make that address configurable / parameterized, then a user can
> connect a remote Spark cluster with no need to have spark jars on their
> local machine. I have got this request from some R users. Their company has
> a Spark cluster (usually managed by another team), and they want to connect
> to it from their workstation (e.g., from within RStudio, etc).
>
>
>
> --Hossein
>
> On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> I don't think the crux of the problem is about users who download the
>> source -- Spark's source distribution is clearly marked as something
>> that needs to be built and they can run `mvn -DskipTests -Psparkr
>> package` based on instructions in the Spark docs.
>>
>> The crux of the problem is that with a source or binary R package, the
>> client side the SparkR code needs the Spark JARs to be available. So
>> we can't just connect to a remote Spark cluster using just the R
>> scripts as we need the Scala classes around to create a Spark context
>> etc.
>>
>> But this is a use case that I've heard from a lot of users -- my take
>> is that this should be a separate package / layer on top of SparkR.
>> Dan Putler (cc'd) had a proposal on a client package for this and
>> maybe able to add more.
>>
>> Thanks
>> Shivaram
>>
>> On Thu, Sep 24, 2015 at 11:36 AM, Hossein  wrote:
>> > Requiring users to download entire Spark distribution to connect to a
>> remote
>> > cluster (which is already running Spark) seems an over kill. Even for
>> most
>> > spark users who download Spark source, it is very unintuitive that they
>> need
>> > to run a script named "install-dev.sh" before they can run SparkR.
>> >
>> > --Hossein
>> >
>> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui  wrote:
>> >>
>> >> SparkR package is not a standalone R package, as it is actually R API
>> of
>> >> Spark and needs to co-operate with a matching version of Spark, so
>> exposing
>> >> it in CRAN does not ease use of R users as they need to download
>> matching
>> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> >> (packageing with Spark), is this desirable? Actually, for normal users
>> who
>> >> are not developers, they are not required to download Spark source,
>> build
>> >> and install SparkR package. They just need to download a Spark
>> distribution,
>> >> and then use SparkR.
>> >>
>> >>
>> >>
>> >> For using SparkR in Rstudio, there is a documentation at
>> >> https://github.com/apache/spark/tree/master/R
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> From: Hossein [mailto:fal...@gmail.com]
>> >> Sent: Thursday, September 24, 2015 1:42 AM
>> >> To: shiva...@eecs.berkeley.edu
>> >> Cc: Sun, Rui; dev@spark.apache.org
>> >> Subject: Re: SparkR package path
>> >>
>> >>
>> >>
>> >> Yes, I think exposing SparkR in CRAN can significantly expand the
>> reach of
>> >> both SparkR and Spark itself to a larger community of data scientists
>> (and
>> >> statisticians).
>> >>
>> >>
>> >>
>> >> I have been getting questions on how to use SparkR in RStudio. Most of
>> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
>> While
>> >> that is a bigger task, for now, first step could be not requiring them
>> to
>> >> download Spark source and run a script that is named install-dev.sh. I
>> filed
>> >> SPARK-10776 to track this.
>> >>
>> >>
>> >>
>> >>
>> >> --Hossein
>> >>
>> >>
>> >>
>> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> >>  wrote:
>> >>
>> >> As Rui says it would be good to understand the use case we want to
>> >> support (supporting CRAN installs could be one for example). I don't
>> >> think it should be very hard to do as the RBackend itself doesn't use
>> >> the R source files. The RRDD does use it and the value comes from
>> >>
>> >>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> >> AFAIK -- So we could introduce a new config flag that can be used for
>> >> this new mode.
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >>
>> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
>> >> > Hossein,
>> >> >
>> >> >
>> >> >
>> >> > Any strong reason to download and install SparkR source package
>> >> > separately
>> >> > from the Spark distribution?
>> >> >
>> >> > An R user can simply download the spark distributio

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Hossein

+1 tested SparkR on Mac and Linux.

--Hossein

On Thu, Sep 24, 2015 at 3:10 PM, Xiangrui Meng  wrote:

> +1. Checked user guide and API doc, and ran some MLlib and SparkR
> examples. -Xiangrui
>
> On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin  wrote:
> > I'm going to +1 this myself. Tested on my laptop.
> >
> >
> >
> > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin 
> wrote:
> >>
> >> I forked a new thread for this. Please discuss NOTICE file related
> things
> >> there so it doesn't hijack this thread.
> >>
> >>
> >> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
> >>>
> >>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
> >>> wrote:
> >>> > Under your guidance, I would be happy to help compile a NOTICE file
> >>> > which
> >>> > follows the pattern used by Derby and the JDK. This effort might
> >>> > proceed in
> >>> > parallel with vetting 1.5.1 and could be targeted at a later release
> >>> > vehicle. I don't think that the ASF's exposure is greatly increased
> by
> >>> > one
> >>> > more release which follows the old pattern.
> >>>
> >>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
> >>> been trying to do and seems like we're even required to do so, not
> >>> follow a different convention. There is some specific guidance there
> >>> about what to add, and not add, to these files. Specifically, because
> >>> the AL2 requires downstream projects to embed the contents of NOTICE,
> >>> the guidance is to only include elements in NOTICE that must appear
> >>> there.
> >>>
> >>> Put it this way -- what would you like to change specifically? (you
> >>> can start another thread for that)
> >>>
> >>> >> My assessment (just looked before I saw Sean's email) is the same as
> >>> >> his. The NOTICE file embeds other projects' licenses.
> >>> >
> >>> > This may be where our perspectives diverge. I did not find those
> >>> > licenses
> >>> > embedded in the NOTICE file. As I see it, the licenses are cited but
> >>> > not
> >>> > included.
> >>>
> >>> Pretty sure that was meant to say that NOTICE embeds other projects'
> >>> "notices", not licenses. And those notices can have all kinds of
> >>> stuff, including licenses.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Xiangrui Meng

+1. Checked user guide and API doc, and ran some MLlib and SparkR
examples. -Xiangrui

On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin  wrote:
> I'm going to +1 this myself. Tested on my laptop.
>
>
>
> On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin  wrote:
>>
>> I forked a new thread for this. Please discuss NOTICE file related things
>> there so it doesn't hijack this thread.
>>
>>
>> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
>>>
>>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
>>> wrote:
>>> > Under your guidance, I would be happy to help compile a NOTICE file
>>> > which
>>> > follows the pattern used by Derby and the JDK. This effort might
>>> > proceed in
>>> > parallel with vetting 1.5.1 and could be targeted at a later release
>>> > vehicle. I don't think that the ASF's exposure is greatly increased by
>>> > one
>>> > more release which follows the old pattern.
>>>
>>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
>>> been trying to do and seems like we're even required to do so, not
>>> follow a different convention. There is some specific guidance there
>>> about what to add, and not add, to these files. Specifically, because
>>> the AL2 requires downstream projects to embed the contents of NOTICE,
>>> the guidance is to only include elements in NOTICE that must appear
>>> there.
>>>
>>> Put it this way -- what would you like to change specifically? (you
>>> can start another thread for that)
>>>
>>> >> My assessment (just looked before I saw Sean's email) is the same as
>>> >> his. The NOTICE file embeds other projects' licenses.
>>> >
>>> > This may be where our perspectives diverge. I did not find those
>>> > licenses
>>> > embedded in the NOTICE file. As I see it, the licenses are cited but
>>> > not
>>> > included.
>>>
>>> Pretty sure that was meant to say that NOTICE embeds other projects'
>>> "notices", not licenses. And those notices can have all kinds of
>>> stuff, including licenses.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin

I'm going to +1 this myself. Tested on my laptop.



On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin  wrote:

> I forked a new thread for this. Please discuss NOTICE file related things
> there so it doesn't hijack this thread.
>
>
> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
>
>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
>> wrote:
>> > Under your guidance, I would be happy to help compile a NOTICE file
>> which
>> > follows the pattern used by Derby and the JDK. This effort might
>> proceed in
>> > parallel with vetting 1.5.1 and could be targeted at a later release
>> > vehicle. I don't think that the ASF's exposure is greatly increased by
>> one
>> > more release which follows the old pattern.
>>
>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
>> been trying to do and seems like we're even required to do so, not
>> follow a different convention. There is some specific guidance there
>> about what to add, and not add, to these files. Specifically, because
>> the AL2 requires downstream projects to embed the contents of NOTICE,
>> the guidance is to only include elements in NOTICE that must appear
>> there.
>>
>> Put it this way -- what would you like to change specifically? (you
>> can start another thread for that)
>>
>> >> My assessment (just looked before I saw Sean's email) is the same as
>> >> his. The NOTICE file embeds other projects' licenses.
>> >
>> > This may be where our perspectives diverge. I did not find those
>> licenses
>> > embedded in the NOTICE file. As I see it, the licenses are cited but not
>> > included.
>>
>> Pretty sure that was meant to say that NOTICE embeds other projects'
>> "notices", not licenses. And those notices can have all kinds of
>> stuff, including licenses.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: SparkR package path

2015-09-24 Thread Hossein

Right now in sparkR.R the backend hostname is hard coded to "localhost" (
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can
connect a remote Spark cluster with no need to have spark jars on their
local machine. I have got this request from some R users. Their company has
a Spark cluster (usually managed by another team), and they want to connect
to it from their workstation (e.g., from within RStudio, etc).



--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I don't think the crux of the problem is about users who download the
> source -- Spark's source distribution is clearly marked as something
> that needs to be built and they can run `mvn -DskipTests -Psparkr
> package` based on instructions in the Spark docs.
>
> The crux of the problem is that with a source or binary R package, the
> client side the SparkR code needs the Spark JARs to be available. So
> we can't just connect to a remote Spark cluster using just the R
> scripts as we need the Scala classes around to create a Spark context
> etc.
>
> But this is a use case that I've heard from a lot of users -- my take
> is that this should be a separate package / layer on top of SparkR.
> Dan Putler (cc'd) had a proposal on a client package for this and
> maybe able to add more.
>
> Thanks
> Shivaram
>
> On Thu, Sep 24, 2015 at 11:36 AM, Hossein  wrote:
> > Requiring users to download entire Spark distribution to connect to a
> remote
> > cluster (which is already running Spark) seems an over kill. Even for
> most
> > spark users who download Spark source, it is very unintuitive that they
> need
> > to run a script named "install-dev.sh" before they can run SparkR.
> >
> > --Hossein
> >
> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui  wrote:
> >>
> >> SparkR package is not a standalone R package, as it is actually R API of
> >> Spark and needs to co-operate with a matching version of Spark, so
> exposing
> >> it in CRAN does not ease use of R users as they need to download
> matching
> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
> >> (packageing with Spark), is this desirable? Actually, for normal users
> who
> >> are not developers, they are not required to download Spark source,
> build
> >> and install SparkR package. They just need to download a Spark
> distribution,
> >> and then use SparkR.
> >>
> >>
> >>
> >> For using SparkR in Rstudio, there is a documentation at
> >> https://github.com/apache/spark/tree/master/R
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Hossein [mailto:fal...@gmail.com]
> >> Sent: Thursday, September 24, 2015 1:42 AM
> >> To: shiva...@eecs.berkeley.edu
> >> Cc: Sun, Rui; dev@spark.apache.org
> >> Subject: Re: SparkR package path
> >>
> >>
> >>
> >> Yes, I think exposing SparkR in CRAN can significantly expand the reach
> of
> >> both SparkR and Spark itself to a larger community of data scientists
> (and
> >> statisticians).
> >>
> >>
> >>
> >> I have been getting questions on how to use SparkR in RStudio. Most of
> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
> While
> >> that is a bigger task, for now, first step could be not requiring them
> to
> >> download Spark source and run a script that is named install-dev.sh. I
> filed
> >> SPARK-10776 to track this.
> >>
> >>
> >>
> >>
> >> --Hossein
> >>
> >>
> >>
> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
> >>  wrote:
> >>
> >> As Rui says it would be good to understand the use case we want to
> >> support (supporting CRAN installs could be one for example). I don't
> >> think it should be very hard to do as the RBackend itself doesn't use
> >> the R source files. The RRDD does use it and the value comes from
> >>
> >>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> >> AFAIK -- So we could introduce a new config flag that can be used for
> >> this new mode.
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
> >> > Hossein,
> >> >
> >> >
> >> >
> >> > Any strong reason to download and install SparkR source package
> >> > separately
> >> > from the Spark distribution?
> >> >
> >> > An R user can simply download the spark distribution, which contains
> >> > SparkR
> >> > source and binary package, and directly use sparkR. No need to install
> >> > SparkR package at all.
> >> >
> >> >
> >> >
> >> > From: Hossein [mailto:fal...@gmail.com]
> >> > Sent: Tuesday, September 22, 2015 9:19 AM
> >> > To: dev@spark.apache.org
> >> > Subject: SparkR package path
> >> >
> >> >
> >> >
> >> > Hi dev list,
> >> >
> >> >
> >> >
> >> > SparkR backend assumes SparkR source files are located under
> >> > "SPARK_HOME/R/lib/." This directory is created by running
> >> > R/install-dev.sh.
> >> > This setting makes sense for Spark developers, but if an R user
> >> > downloads
> >>

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Sean Owen

Yes, but the ASF's reading seems to be clear:
http://www.apache.org/dev/licensing-howto.html#permissive-deps
"In LICENSE, add a pointer to the dependency's license within the
source tree and a short note summarizing its licensing:"

I'd be concerned if you get a different interpretation from the ASF. I
suppose it's OK to ask the question again, but for the moment I don't
see a reason to believe there's a problem.

On Thu, Sep 24, 2015 at 9:05 PM, Richard Hillegas  wrote:
> Hi Sean,
>
> My reading would be that a separate copy of the BSD license, with copyright
> years filled in, is required for each BSD-licensed dependency. Same for
> MIT-licensed dependencies. Hopefully, we will receive some guidance on
> https://issues.apache.org/jira/browse/LEGAL-226
>
> Thanks,
> -Rick
>
>
>
> Sean Owen  wrote on 09/24/2015 12:40:12 PM:
>
>> From: Sean Owen 
>> To: Richard Hillegas/San Francisco/IBM@IBMUS
>> Cc: "dev@spark.apache.org" 
>> Date: 09/24/2015 12:40 PM
>
>
>> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
>>
>> Yes, the issue of where 3rd-party license information goes is
>> different, and varies by license. I think the BSD/MIT licenses are all
>> already listed in LICENSE accordingly. Let me know if you spy an
>> omission.
>>
>> On Thu, Sep 24, 2015 at 8:36 PM, Richard Hillegas 
>> wrote:
>> > Thanks for that pointer, Sean. It may be that Derby is putting the
>> > license
>> > information in the wrong place, viz. in the NOTICE file. But the 3rd
>> > party
>> > license text may need to go somewhere else. See for instance the advice
>> > a
>> > little further up the page at
>> > http://www.apache.org/dev/licensing-howto.html#permissive-deps
>> >
>> > Thanks,
>> > -Rick
>> >
>> > Sean Owen  wrote on 09/24/2015 12:07:01 PM:
>> >
>> >> From: Sean Owen 
>> >> To: Richard Hillegas/San Francisco/IBM@IBMUS
>> >> Cc: "dev@spark.apache.org" 
>> >> Date: 09/24/2015 12:08 PM
>> >> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
>> >
>> >
>> >>
>> >> Have a look at
>> >> http://www.apache.org/dev/licensing-howto.html#mod-notice
>> >> though, which makes a good point about limiting what goes into NOTICE
>> >> to what is required. That's what makes me think we shouldn't do this.
>> >>
>> >> On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas 
>> >> wrote:
>> >> > To answer Sean's question on the previous email thread, I would
>> >> > propose
>> >> > making changes like the following to the NOTICE file:
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Richard Hillegas


Hi Sean,

My reading would be that a separate copy of the BSD license, with copyright
years filled in, is required for each BSD-licensed dependency. Same for
MIT-licensed dependencies. Hopefully, we will receive some guidance on
https://issues.apache.org/jira/browse/LEGAL-226

Thanks,
-Rick



Sean Owen  wrote on 09/24/2015 12:40:12 PM:

> From: Sean Owen 
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: "dev@spark.apache.org" 
> Date: 09/24/2015 12:40 PM
> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
>
> Yes, the issue of where 3rd-party license information goes is
> different, and varies by license. I think the BSD/MIT licenses are all
> already listed in LICENSE accordingly. Let me know if you spy an
> omission.
>
> On Thu, Sep 24, 2015 at 8:36 PM, Richard Hillegas 
wrote:
> > Thanks for that pointer, Sean. It may be that Derby is putting the
license
> > information in the wrong place, viz. in the NOTICE file. But the 3rd
party
> > license text may need to go somewhere else. See for instance the advice
a
> > little further up the page at
> > http://www.apache.org/dev/licensing-howto.html#permissive-deps
> >
> > Thanks,
> > -Rick
> >
> > Sean Owen  wrote on 09/24/2015 12:07:01 PM:
> >
> >> From: Sean Owen 
> >> To: Richard Hillegas/San Francisco/IBM@IBMUS
> >> Cc: "dev@spark.apache.org" 
> >> Date: 09/24/2015 12:08 PM
> >> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
> >
> >
> >>
> >> Have a look at
http://www.apache.org/dev/licensing-howto.html#mod-notice
> >> though, which makes a good point about limiting what goes into NOTICE
> >> to what is required. That's what makes me think we shouldn't do this.
> >>
> >> On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas 
> >> wrote:
> >> > To answer Sean's question on the previous email thread, I would
propose
> >> > making changes like the following to the NOTICE file:
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Sean Owen

Yes, the issue of where 3rd-party license information goes is
different, and varies by license. I think the BSD/MIT licenses are all
already listed in LICENSE accordingly. Let me know if you spy an
omission.

On Thu, Sep 24, 2015 at 8:36 PM, Richard Hillegas  wrote:
> Thanks for that pointer, Sean. It may be that Derby is putting the license
> information in the wrong place, viz. in the NOTICE file. But the 3rd party
> license text may need to go somewhere else. See for instance the advice a
> little further up the page at
> http://www.apache.org/dev/licensing-howto.html#permissive-deps
>
> Thanks,
> -Rick
>
> Sean Owen  wrote on 09/24/2015 12:07:01 PM:
>
>> From: Sean Owen 
>> To: Richard Hillegas/San Francisco/IBM@IBMUS
>> Cc: "dev@spark.apache.org" 
>> Date: 09/24/2015 12:08 PM
>> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
>
>
>>
>> Have a look at http://www.apache.org/dev/licensing-howto.html#mod-notice
>> though, which makes a good point about limiting what goes into NOTICE
>> to what is required. That's what makes me think we shouldn't do this.
>>
>> On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas 
>> wrote:
>> > To answer Sean's question on the previous email thread, I would propose
>> > making changes like the following to the NOTICE file:
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Richard Hillegas

Thanks for that pointer, Sean. It may be that Derby is putting the license
information in the wrong place, viz. in the NOTICE file. But the 3rd party
license text may need to go somewhere else. See for instance the advice a
little further up the page at
http://www.apache.org/dev/licensing-howto.html#permissive-deps

Thanks,
-Rick

Sean Owen  wrote on 09/24/2015 12:07:01 PM:

> From: Sean Owen 
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: "dev@spark.apache.org" 
> Date: 09/24/2015 12:08 PM
> Subject: Re: [Discuss] NOTICE file for transitive "NOTICE"s
>
> Have a look at http://www.apache.org/dev/licensing-howto.html#mod-notice
> though, which makes a good point about limiting what goes into NOTICE
> to what is required. That's what makes me think we shouldn't do this.
>
> On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas 
wrote:
> > To answer Sean's question on the previous email thread, I would propose
> > making changes like the following to the NOTICE file:
>

Re: SparkR package path

2015-09-24 Thread Shivaram Venkataraman

I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein  wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui  wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:fal...@gmail.com]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shiva...@eecs.berkeley.edu
>> Cc: Sun, Rui; dev@spark.apache.org
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>>  wrote:
>>
>> As Rui says it would be good to understand the use case we want to
>> support (supporting CRAN installs could be one for example). I don't
>> think it should be very hard to do as the RBackend itself doesn't use
>> the R source files. The RRDD does use it and the value comes from
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> AFAIK -- So we could introduce a new config flag that can be used for
>> this new mode.
>>
>> Thanks
>> Shivaram
>>
>>
>> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
>> > Hossein,
>> >
>> >
>> >
>> > Any strong reason to download and install SparkR source package
>> > separately
>> > from the Spark distribution?
>> >
>> > An R user can simply download the spark distribution, which contains
>> > SparkR
>> > source and binary package, and directly use sparkR. No need to install
>> > SparkR package at all.
>> >
>> >
>> >
>> > From: Hossein [mailto:fal...@gmail.com]
>> > Sent: Tuesday, September 22, 2015 9:19 AM
>> > To: dev@spark.apache.org
>> > Subject: SparkR package path
>> >
>> >
>> >
>> > Hi dev list,
>> >
>> >
>> >
>> > SparkR backend assumes SparkR source files are located under
>> > "SPARK_HOME/R/lib/." This directory is created by running
>> > R/install-dev.sh.
>> > This setting makes sense for Spark developers, but if an R user
>> > downloads
>> > and installs SparkR source package, the source files are going to be in
>> > placed different locations.
>> >
>> >
>> >
>> > In the R runtime it is easy to find location of package files using
>> > path.package("SparkR"). But we need to make some changes to R backend
>> > and/or
>> > spark-submit so that, JVM process learns the location of worker.R and
>> > daemon.R and shell.R from the R runtime.
>> >
>> >
>> >
>> > Do you think this change is feasible?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > --Hossein
>>
>>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Sean Owen

Have a look at http://www.apache.org/dev/licensing-howto.html#mod-notice
though, which makes a good point about limiting what goes into NOTICE
to what is required. That's what makes me think we shouldn't do this.

On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas  wrote:
> To answer Sean's question on the previous email thread, I would propose
> making changes like the following to the NOTICE file:

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SparkR package path

2015-09-24 Thread Hossein

Requiring users to download entire Spark distribution to connect to a
remote cluster (which is already running Spark) seems an over kill. Even
for most spark users who download Spark source, it is very unintuitive that
they need to run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui  wrote:

> SparkR package is not a standalone R package, as it is actually R API of
> Spark and needs to co-operate with a matching version of Spark, so exposing
> it in CRAN does not ease use of R users as they need to download matching
> Spark distribution, unless we expose a bundled SparkR package to CRAN
> (packageing with Spark), is this desirable? Actually, for normal users who
> are not developers, they are not required to download Spark source, build
> and install SparkR package. They just need to download a Spark
> distribution, and then use SparkR.
>
>
>
> For using SparkR in Rstudio, there is a documentation at
> https://github.com/apache/spark/tree/master/R
>
>
>
>
>
>
>
> *From:* Hossein [mailto:fal...@gmail.com]
> *Sent:* Thursday, September 24, 2015 1:42 AM
> *To:* shiva...@eecs.berkeley.edu
> *Cc:* Sun, Rui; dev@spark.apache.org
> *Subject:* Re: SparkR package path
>
>
>
> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
> both SparkR and Spark itself to a larger community of data scientists (and
> statisticians).
>
>
>
> I have been getting questions on how to use SparkR in RStudio. Most of
> these folks have a Spark Cluster and wish to talk to it from RStudio. While
> that is a bigger task, for now, first step could be not requiring them to
> download Spark source and run a script that is named install-dev.sh. I
> filed SPARK-10776 to track this.
>
>
>
>
> --Hossein
>
>
>
> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:fal...@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>
>
>

Re: [Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Richard Hillegas

Thanks for forking the new email thread, Reynold. It is entirely possible
that I am being overly skittish. I have posed a question for our legal
experts: https://issues.apache.org/jira/browse/LEGAL-226

To answer Sean's question on the previous email thread, I would propose
making changes like the following to the NOTICE file:

Replace a stanza like this...

"This product contains a modified version of 'JZlib', a re-implementation
of
zlib in pure Java, which can be obtained at:

  * LICENSE:
* license/LICENSE.jzlib.txt (BSD Style License)
  * HOMEPAGE:
* http://www.jcraft.com/jzlib/";

...with full license text like this

"This product contains a modified version of 'JZlib', a re-implementation
of
zlib in pure Java, which can be obtained at:

  * HOMEPAGE:
* http://www.jcraft.com/jzlib/

The ZLIB license text follows:

JZlib 0.0.* were released under the GNU LGPL license.  Later, we have
switched
over to a BSD-style license.

--
Copyright (c) 2000-2011 ymnk, JCraft,Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice,
 this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in
 the documentation and/or other materials provided with the
distribution.

  3. The names of the authors may not be used to endorse or promote
products
 derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL JCRAFT,
INC. OR ANY CONTRIBUTORS TO THIS SOFTWARE BE LIABLE FOR ANY DIRECT,
INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE."

Thanks,
-Rick

Reynold Xin  wrote on 09/24/2015 10:55:53 AM:

> From: Reynold Xin 
> To: Sean Owen 
> Cc: Richard Hillegas/San Francisco/IBM@IBMUS, "dev@spark.apache.org"
> 
> Date: 09/24/2015 10:56 AM
> Subject: [Discuss] NOTICE file for transitive "NOTICE"s
>
> Richard,
>
> Thanks for bringing this up and this is a great point. Let's start
> another thread for it so we don't hijack the release thread.
>
> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
wrote:
> > Under your guidance, I would be happy to help compile a NOTICE file
which
> > follows the pattern used by Derby and the JDK. This effort might
proceed in
> > parallel with vetting 1.5.1 and could be targeted at a later release
> > vehicle. I don't think that the ASF's exposure is greatly increased by
one
> > more release which follows the old pattern.
>
> I'd prefer to use the ASF's preferred pattern, no? That's what we've
> been trying to do and seems like we're even required to do so, not
> follow a different convention. There is some specific guidance there
> about what to add, and not add, to these files. Specifically, because
> the AL2 requires downstream projects to embed the contents of NOTICE,
> the guidance is to only include elements in NOTICE that must appear
> there.
>
> Put it this way -- what would you like to change specifically? (you
> can start another thread for that)
>
> >> My assessment (just looked before I saw Sean's email) is the same as
> >> his. The NOTICE file embeds other projects' licenses.
> >
> > This may be where our perspectives diverge. I did not find those
licenses
> > embedded in the NOTICE file. As I see it, the licenses are cited but
not
> > included.
>
> Pretty sure that was meant to say that NOTICE embeds other projects'
> "notices", not licenses. And those notices can have all kinds of
> stuff, including licenses.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin

I forked a new thread for this. Please discuss NOTICE file related things
there so it doesn't hijack this thread.


On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:

> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
> wrote:
> > Under your guidance, I would be happy to help compile a NOTICE file which
> > follows the pattern used by Derby and the JDK. This effort might proceed
> in
> > parallel with vetting 1.5.1 and could be targeted at a later release
> > vehicle. I don't think that the ASF's exposure is greatly increased by
> one
> > more release which follows the old pattern.
>
> I'd prefer to use the ASF's preferred pattern, no? That's what we've
> been trying to do and seems like we're even required to do so, not
> follow a different convention. There is some specific guidance there
> about what to add, and not add, to these files. Specifically, because
> the AL2 requires downstream projects to embed the contents of NOTICE,
> the guidance is to only include elements in NOTICE that must appear
> there.
>
> Put it this way -- what would you like to change specifically? (you
> can start another thread for that)
>
> >> My assessment (just looked before I saw Sean's email) is the same as
> >> his. The NOTICE file embeds other projects' licenses.
> >
> > This may be where our perspectives diverge. I did not find those licenses
> > embedded in the NOTICE file. As I see it, the licenses are cited but not
> > included.
>
> Pretty sure that was meant to say that NOTICE embeds other projects'
> "notices", not licenses. And those notices can have all kinds of
> stuff, including licenses.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

[Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Reynold Xin

Richard,

Thanks for bringing this up and this is a great point. Let's start another
thread for it so we don't hijack the release thread.



On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:

> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
> wrote:
> > Under your guidance, I would be happy to help compile a NOTICE file which
> > follows the pattern used by Derby and the JDK. This effort might proceed
> in
> > parallel with vetting 1.5.1 and could be targeted at a later release
> > vehicle. I don't think that the ASF's exposure is greatly increased by
> one
> > more release which follows the old pattern.
>
> I'd prefer to use the ASF's preferred pattern, no? That's what we've
> been trying to do and seems like we're even required to do so, not
> follow a different convention. There is some specific guidance there
> about what to add, and not add, to these files. Specifically, because
> the AL2 requires downstream projects to embed the contents of NOTICE,
> the guidance is to only include elements in NOTICE that must appear
> there.
>
> Put it this way -- what would you like to change specifically? (you
> can start another thread for that)
>
> >> My assessment (just looked before I saw Sean's email) is the same as
> >> his. The NOTICE file embeds other projects' licenses.
> >
> > This may be where our perspectives diverge. I did not find those licenses
> > embedded in the NOTICE file. As I see it, the licenses are cited but not
> > included.
>
> Pretty sure that was meant to say that NOTICE embeds other projects'
> "notices", not licenses. And those notices can have all kinds of
> stuff, including licenses.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen

On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas  wrote:
> Under your guidance, I would be happy to help compile a NOTICE file which
> follows the pattern used by Derby and the JDK. This effort might proceed in
> parallel with vetting 1.5.1 and could be targeted at a later release
> vehicle. I don't think that the ASF's exposure is greatly increased by one
> more release which follows the old pattern.

I'd prefer to use the ASF's preferred pattern, no? That's what we've
been trying to do and seems like we're even required to do so, not
follow a different convention. There is some specific guidance there
about what to add, and not add, to these files. Specifically, because
the AL2 requires downstream projects to embed the contents of NOTICE,
the guidance is to only include elements in NOTICE that must appear
there.

Put it this way -- what would you like to change specifically? (you
can start another thread for that)

>> My assessment (just looked before I saw Sean's email) is the same as
>> his. The NOTICE file embeds other projects' licenses.
>
> This may be where our perspectives diverge. I did not find those licenses
> embedded in the NOTICE file. As I see it, the licenses are cited but not
> included.

Pretty sure that was meant to say that NOTICE embeds other projects'
"notices", not licenses. And those notices can have all kinds of
stuff, including licenses.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Richard Hillegas


Hi Sean and Wendell,

I share your concerns about how difficult and important it is to get this
right. I think that the Spark community has compiled a very readable and
well organized NOTICE file. A lot of careful thought went into gathering
together 3rd party projects which share the same license text.

All I can offer is my own experience of having served as a release manager
for a sister Apache project (Derby) over the past ten years. The Derby
NOTICE file recites 3rd party licenses verbatim. This is also the approach
taken by the THIRDPARTYLICENSEREADME.txt in the JDK. I am not a lawyer.
However, I have great respect for the experience and legal sensitivities of
the people who compile that JDK license file.

Under your guidance, I would be happy to help compile a NOTICE file which
follows the pattern used by Derby and the JDK. This effort might proceed in
parallel with vetting 1.5.1 and could be targeted at a later release
vehicle. I don't think that the ASF's exposure is greatly increased by one
more release which follows the old pattern.

Another comment inline...

Patrick Wendell  wrote on 09/24/2015 10:24:25 AM:

> From: Patrick Wendell 
> To: Sean Owen 
> Cc: Richard Hillegas/San Francisco/IBM@IBMUS, "dev@spark.apache.org"
> 
> Date: 09/24/2015 10:24 AM
> Subject: Re: [VOTE] Release Apache Spark 1.5.1 (RC1)
>
> Hey Richard,
>
> My assessment (just looked before I saw Sean's email) is the same as
> his. The NOTICE file embeds other projects' licenses.

This may be where our perspectives diverge. I did not find those licenses
embedded in the NOTICE file. As I see it, the licenses are cited but not
included.

Thanks,
-Rick


> If those
> licenses themselves have pointers to other files or dependencies, we
> don't embed them. I think this is standard practice.
>
> - Patrick
>
> On Thu, Sep 24, 2015 at 10:00 AM, Sean Owen  wrote:
> > Hi Richard, those are messages reproduced from other projects' NOTICE
> > files, not created by Spark. They need to be reproduced in Spark's
> > NOTICE file to comply with the license, but their text may or may not
> > apply to Spark's distribution. The intent is that users would track
> > this back to the source project if interested to investigate what the
> > upstream notice is about.
> >
> > Requirements vary by license, but I do not believe there is additional
> > requirement to reproduce these other files. Their license information
> > is already indicated in accordance with the license terms.
> >
> > What licenses are you looking for in LICENSE that you believe
> should be there?
> >
> > Getting all this right is both difficult and important. I've made some
> > efforts over time to strictly comply with the Apache take on
> > licensing, which is at http://www.apache.org/legal/resolved.html  It's
> > entirely possible there's still a mistake somewhere in here (possibly
> > a new dependency, etc). Please point it out if you see such a thing.
> >
> > But so far what you describe is "working as intended", as far as I
> > know, according to Apache.
> >
> >
> > On Thu, Sep 24, 2015 at 5:52 PM, Richard Hillegas
>  wrote:
> >> -1 (non-binding)
> >>
> >> I was able to build Spark cleanly from the source distribution using
the
> >> command in README.md:
> >>
> >> build/mvn -DskipTests clean package
> >>
> >> However, while I was waiting for the build to complete, I started
going
> >> through the NOTICE file. I was confused about where to find
> licenses for 3rd
> >> party software bundled with Spark. About halfway through the NOTICE
file,
> >> starting with Java Collections Framework, there is a list of
> licenses of the
> >> form
> >>
> >>license/*.txt
> >>
> >> But there is no license subdirectory in the source distro. I couldn't
find
> >> the  *.txt license files for Java Collections Framework, Base64
Encoder, or
> >> JZlib anywhere in the source distro. I couldn't find those files in
license
> >> subdirectories at the indicated home pages for those projects. (I did
find
> >> the license for JZLIB somewhere else, however:
> >> http://www.jcraft.com/jzlib/LICENSE.txt.)
> >>
> >> In addition, I couldn't find licenses for those projects in the master
> >> LICENSE file.
> >>
> >> Are users supposed to get licenses from the indicated 3rd party web
sites?
> >> Those online licenses could change. I would feel more comfortableif
the ASF
> >> were protected by our bundling the licenses inside our source distros.
> >>
> >> After looking for those three licenses, I stopped reading the NOTICE
file.
> >> Maybe I'm confused about how to read the NOTICE file. Where should
users
> >> expect to find the 3rd party licenses?
> >>
> >> Thanks,
> >> -Rick
> >>
> >> Reynold Xin  wrote on 09/24/2015 12:27:25 AM:
> >>
> >>> From: Reynold Xin 
> >>> To: "dev@spark.apache.org" 
> >>> Date: 09/24/2015 12:28 AM
> >>> Subject: [VOTE] Release Apache Spark 1.5.1 (RC1)
> >>
> >>
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> >>> version 1.5.1. The vote is open unti

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Patrick Wendell

Hey Richard,

My assessment (just looked before I saw Sean's email) is the same as
his. The NOTICE file embeds other projects' licenses. If those
licenses themselves have pointers to other files or dependencies, we
don't embed them. I think this is standard practice.

- Patrick

On Thu, Sep 24, 2015 at 10:00 AM, Sean Owen  wrote:
> Hi Richard, those are messages reproduced from other projects' NOTICE
> files, not created by Spark. They need to be reproduced in Spark's
> NOTICE file to comply with the license, but their text may or may not
> apply to Spark's distribution. The intent is that users would track
> this back to the source project if interested to investigate what the
> upstream notice is about.
>
> Requirements vary by license, but I do not believe there is additional
> requirement to reproduce these other files. Their license information
> is already indicated in accordance with the license terms.
>
> What licenses are you looking for in LICENSE that you believe should be there?
>
> Getting all this right is both difficult and important. I've made some
> efforts over time to strictly comply with the Apache take on
> licensing, which is at http://www.apache.org/legal/resolved.html  It's
> entirely possible there's still a mistake somewhere in here (possibly
> a new dependency, etc). Please point it out if you see such a thing.
>
> But so far what you describe is "working as intended", as far as I
> know, according to Apache.
>
>
> On Thu, Sep 24, 2015 at 5:52 PM, Richard Hillegas  wrote:
>> -1 (non-binding)
>>
>> I was able to build Spark cleanly from the source distribution using the
>> command in README.md:
>>
>> build/mvn -DskipTests clean package
>>
>> However, while I was waiting for the build to complete, I started going
>> through the NOTICE file. I was confused about where to find licenses for 3rd
>> party software bundled with Spark. About halfway through the NOTICE file,
>> starting with Java Collections Framework, there is a list of licenses of the
>> form
>>
>>license/*.txt
>>
>> But there is no license subdirectory in the source distro. I couldn't find
>> the  *.txt license files for Java Collections Framework, Base64 Encoder, or
>> JZlib anywhere in the source distro. I couldn't find those files in license
>> subdirectories at the indicated home pages for those projects. (I did find
>> the license for JZLIB somewhere else, however:
>> http://www.jcraft.com/jzlib/LICENSE.txt.)
>>
>> In addition, I couldn't find licenses for those projects in the master
>> LICENSE file.
>>
>> Are users supposed to get licenses from the indicated 3rd party web sites?
>> Those online licenses could change. I would feel more comfortable if the ASF
>> were protected by our bundling the licenses inside our source distros.
>>
>> After looking for those three licenses, I stopped reading the NOTICE file.
>> Maybe I'm confused about how to read the NOTICE file. Where should users
>> expect to find the 3rd party licenses?
>>
>> Thanks,
>> -Rick
>>
>> Reynold Xin  wrote on 09/24/2015 12:27:25 AM:
>>
>>> From: Reynold Xin 
>>> To: "dev@spark.apache.org" 
>>> Date: 09/24/2015 12:28 AM
>>> Subject: [VOTE] Release Apache Spark 1.5.1 (RC1)
>>
>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>>> http://s.apache.org/spark-1.5.1
>>>
>>> The tag to be voted on is v1.5.1-rc1:
>>> https://github.com/apache/spark/commit/
>>> 4df97937dbf68a9868de58408b9be0bf87dbbb94
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release (1.5.1) can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1148/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then reporting any regressions.
>>>
>>> 
>>> What justifies a -1 vote for this release?
>>> 
>>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>>> present in 1.5.0 will not block this release.
>>>
>>> ===
>>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen

Hi Richard, those are messages reproduced from other projects' NOTICE
files, not created by Spark. They need to be reproduced in Spark's
NOTICE file to comply with the license, but their text may or may not
apply to Spark's distribution. The intent is that users would track
this back to the source project if interested to investigate what the
upstream notice is about.

Requirements vary by license, but I do not believe there is additional
requirement to reproduce these other files. Their license information
is already indicated in accordance with the license terms.

What licenses are you looking for in LICENSE that you believe should be there?

Getting all this right is both difficult and important. I've made some
efforts over time to strictly comply with the Apache take on
licensing, which is at http://www.apache.org/legal/resolved.html  It's
entirely possible there's still a mistake somewhere in here (possibly
a new dependency, etc). Please point it out if you see such a thing.

But so far what you describe is "working as intended", as far as I
know, according to Apache.


On Thu, Sep 24, 2015 at 5:52 PM, Richard Hillegas  wrote:
> -1 (non-binding)
>
> I was able to build Spark cleanly from the source distribution using the
> command in README.md:
>
> build/mvn -DskipTests clean package
>
> However, while I was waiting for the build to complete, I started going
> through the NOTICE file. I was confused about where to find licenses for 3rd
> party software bundled with Spark. About halfway through the NOTICE file,
> starting with Java Collections Framework, there is a list of licenses of the
> form
>
>license/*.txt
>
> But there is no license subdirectory in the source distro. I couldn't find
> the  *.txt license files for Java Collections Framework, Base64 Encoder, or
> JZlib anywhere in the source distro. I couldn't find those files in license
> subdirectories at the indicated home pages for those projects. (I did find
> the license for JZLIB somewhere else, however:
> http://www.jcraft.com/jzlib/LICENSE.txt.)
>
> In addition, I couldn't find licenses for those projects in the master
> LICENSE file.
>
> Are users supposed to get licenses from the indicated 3rd party web sites?
> Those online licenses could change. I would feel more comfortable if the ASF
> were protected by our bundling the licenses inside our source distros.
>
> After looking for those three licenses, I stopped reading the NOTICE file.
> Maybe I'm confused about how to read the NOTICE file. Where should users
> expect to find the 3rd party licenses?
>
> Thanks,
> -Rick
>
> Reynold Xin  wrote on 09/24/2015 12:27:25 AM:
>
>> From: Reynold Xin 
>> To: "dev@spark.apache.org" 
>> Date: 09/24/2015 12:28 AM
>> Subject: [VOTE] Release Apache Spark 1.5.1 (RC1)
>
>
>>
>> Please vote on releasing the following candidate as Apache Spark
>> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC
>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.1
>> [ ] -1 Do not release this package because ...
>>
>> The release fixes 81 known issues in Spark 1.5.0, listed here:
>> http://s.apache.org/spark-1.5.1
>>
>> The tag to be voted on is v1.5.1-rc1:
>> https://github.com/apache/spark/commit/
>> 4df97937dbf68a9868de58408b9be0bf87dbbb94
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (1.5.1) can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1148/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate,
>> then reporting any regressions.
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
>> present in 1.5.0 will not block this release.
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.1?
>> ===
>> Please target 1.5.2 or 1.6.0.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Richard Hillegas


-1 (non-binding)

I was able to build Spark cleanly from the source distribution using the
command in README.md:

build/mvn -DskipTests clean package

However, while I was waiting for the build to complete, I started going
through the NOTICE file. I was confused about where to find licenses for
3rd party software bundled with Spark. About halfway through the NOTICE
file, starting with Java Collections Framework, there is a list of licenses
of the form

   license/*.txt

But there is no license subdirectory in the source distro. I couldn't find
the  *.txt license files for Java Collections Framework, Base64 Encoder, or
JZlib anywhere in the source distro. I couldn't find those files in license
subdirectories at the indicated home pages for those projects. (I did find
the license for JZLIB somewhere else, however:
http://www.jcraft.com/jzlib/LICENSE.txt.)

In addition, I couldn't find licenses for those projects in the master
LICENSE file.

Are users supposed to get licenses from the indicated 3rd party web sites?
Those online licenses could change. I would feel more comfortable if the
ASF were protected by our bundling the licenses inside our source distros.

After looking for those three licenses, I stopped reading the NOTICE file.
Maybe I'm confused about how to read the NOTICE file. Where should users
expect to find the 3rd party licenses?

Thanks,
-Rick

Reynold Xin  wrote on 09/24/2015 12:27:25 AM:

> From: Reynold Xin 
> To: "dev@spark.apache.org" 
> Date: 09/24/2015 12:28 AM
> Subject: [VOTE] Release Apache Spark 1.5.1 (RC1)
>
> Please vote on releasing the following candidate as Apache Spark
> version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC
> and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
> https://github.com/apache/spark/commit/
> 4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1148/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate,
> then reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
> present in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-24 Thread shane knapp

...and we're finished and now building!

On Thu, Sep 24, 2015 at 7:19 AM, shane knapp  wrote:
> this is happening now.
>
> On Tue, Sep 22, 2015 at 10:07 AM, shane knapp  wrote:
>> ok, here's the updated downtime schedule for this week:
>>
>> wednesday, sept 23rd:
>>
>> firewall maintenance cancelled, as jon took care of the update
>> saturday morning while we were bringing jenkins back up after the colo
>> fire
>>
>> thursday, sept 24th:
>>
>> jenkins maintenance is still scheduled, but abbreviated as some of the
>> maintenance was performed saturday morning as well
>> * new builds will stop being accepted ~630am PDT
>>   - i'll kill any hangers-on at 730am, and after maintenance is done,
>> i will retrigger any killed jobs
>> * jenkins worker system package updates
>>   - amp-jenkins-master was completed on saturday
>>   - this will NOT include kernel updates as moving to
>> 2.6.32-573.3.1.el6 bricked amp-jenkins-master
>> * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79
>> * all systems get a reboot
>> * expected downtime:  3.5 hours or so
>>
>> i'll post updates as i progress.
>>
>> also, i'll post a copy of our post-mortem once the dust settles.  it's
>> been, shall we say, a pretty crazy few days.
>>
>> http://news.berkeley.edu/2015/09/19/campus-network-outage/
>>
>> :)
>>
>> On Mon, Sep 21, 2015 at 10:11 AM, shane knapp  wrote:
>>> quick update:  we actually did some of the maintenance on our systems
>>> after the berkeley-wide outage caused by one of our (non-jenkins)
>>> servers halting and catching fire.
>>>
>>> we'll still have some downtime early wednesday, but tomorrow's will be
>>> cancelled.  i'll send out another update real soon now with what we'll
>>> be covering on wednesday once we get our current situation more under
>>> control.  :)
>>>
>>> On Wed, Sep 16, 2015 at 12:15 PM, shane knapp  wrote:
> 630am-10am thursday, 9-24-15:
> * jenknins update to 1.629 (we're a few months behind in versions, and
> some big bugs have been fixed)
> * jenkins master and worker system package updates
> * all systems get a reboot (lots of hanging java processes have been
> building up over the months)
> * builds will stop being accepted ~630am, and i'll kill any hangers-on
> at 730am, and retrigger once we're done
> * expected downtime:  3.5 hours or so
> * i will also be testing out some of my shiny new ansible playbooks
> for the system updates!
>
 i forgot one thing:

 * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-24 Thread shane knapp

this is happening now.

On Tue, Sep 22, 2015 at 10:07 AM, shane knapp  wrote:
> ok, here's the updated downtime schedule for this week:
>
> wednesday, sept 23rd:
>
> firewall maintenance cancelled, as jon took care of the update
> saturday morning while we were bringing jenkins back up after the colo
> fire
>
> thursday, sept 24th:
>
> jenkins maintenance is still scheduled, but abbreviated as some of the
> maintenance was performed saturday morning as well
> * new builds will stop being accepted ~630am PDT
>   - i'll kill any hangers-on at 730am, and after maintenance is done,
> i will retrigger any killed jobs
> * jenkins worker system package updates
>   - amp-jenkins-master was completed on saturday
>   - this will NOT include kernel updates as moving to
> 2.6.32-573.3.1.el6 bricked amp-jenkins-master
> * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79
> * all systems get a reboot
> * expected downtime:  3.5 hours or so
>
> i'll post updates as i progress.
>
> also, i'll post a copy of our post-mortem once the dust settles.  it's
> been, shall we say, a pretty crazy few days.
>
> http://news.berkeley.edu/2015/09/19/campus-network-outage/
>
> :)
>
> On Mon, Sep 21, 2015 at 10:11 AM, shane knapp  wrote:
>> quick update:  we actually did some of the maintenance on our systems
>> after the berkeley-wide outage caused by one of our (non-jenkins)
>> servers halting and catching fire.
>>
>> we'll still have some downtime early wednesday, but tomorrow's will be
>> cancelled.  i'll send out another update real soon now with what we'll
>> be covering on wednesday once we get our current situation more under
>> control.  :)
>>
>> On Wed, Sep 16, 2015 at 12:15 PM, shane knapp  wrote:
 630am-10am thursday, 9-24-15:
 * jenknins update to 1.629 (we're a few months behind in versions, and
 some big bugs have been fixed)
 * jenkins master and worker system package updates
 * all systems get a reboot (lots of hanging java processes have been
 building up over the months)
 * builds will stop being accepted ~630am, and i'll kill any hangers-on
 at 730am, and retrigger once we're done
 * expected downtime:  3.5 hours or so
 * i will also be testing out some of my shiny new ansible playbooks
 for the system updates!

>>> i forgot one thing:
>>>
>>> * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang

Thanks, it seems good, though a little hack.

And here is another question. updateByKey compute on all the data from the
beginning, but in many situation, we just need to update the coming data.
This could be a big improve on speed and resource. Would this to be support
in the future?

Shixiong Zhu 于2015年9月24日周四 下午6:01写道：

> You can create connection like this:
>
> val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
> => {
>   val dbConnection = create a db connection
>   iterator.flatMap { case (key, values, stateOption) =>
> if (values.isEmpty) {
>   // don't access database
> } else {
>   // update to new state and save to database
> }
> // return new state
>   }
>   TaskContext.get().addTaskCompletionListener(_ => db.disconnect())
> }
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 17:42 GMT+08:00 Bin Wang :
>
>> It seems like a work around. But I don't know how to get the database
>> connection from the working nodes.
>>
>> Shixiong Zhu 于2015年9月24日周四 下午5:37写道：
>>
>>> Could you write your update func like this?
>>>
>>> val updateFunc = (iterator: Iterator[(String, Seq[Int],
>>> Option[Int])]) => {
>>>   iterator.flatMap { case (key, values, stateOption) =>
>>> if (values.isEmpty) {
>>>   // don't access database
>>> } else {
>>>   // update to new state and save to database
>>> }
>>> // return new state
>>>   }
>>> }
>>>
>>> and use this overload:
>>>
>>> def updateStateByKey[S: ClassTag](
>>>   updateFunc: (Seq[V], Option[S]) => Option[S],
>>>   partitioner: Partitioner
>>> ): DStream[(K, S)]
>>>
>>> There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
>>> doesn't have a doc now...
>>>
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-09-24 17:26 GMT+08:00 Bin Wang :
>>>
 Data that are not updated should be saved earlier: while the data added
 to the DStream at the first time, it should be considered as updated. So
 save the same data again is a waste.

 What are the community is doing? Is there any doc or discussion that I
 can look for? Thanks.



 Shixiong Zhu 于2015年9月24日周四 下午4:27写道：

> For data that are not updated, where do you save? Or do you only want
> to avoid accessing database for those that are not updated?
>
> Besides,  the community is working on optimizing "updateStateBykey"'s
> performance. Hope it will be delivered soon.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 13:45 GMT+08:00 Bin Wang :
>
>> I've read the source code and it seems to be impossible, but I'd like
>> to confirm it.
>>
>> It is a very useful feature. For example, I need to store the state
>> of DStream into my database, in order to recovery them from next 
>> redeploy.
>> But I only need to save the updated ones. Save all keys into database is 
>> a
>> lot of waste.
>>
>> Through the source code, I think it could be add easily: StateDStream
>> can get prevStateRDD so that it can make a diff. Is there any chance to 
>> add
>> this as an API of StateDStream? If so, I can work on this feature.
>>
>> If not possible, is there any work around or hack to do this by
>> myself?
>>
>
>
>>>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean Owen

+1 non-binding. This is the first time I've seen all tests pass the
first time with Java 8 + Ubuntu + "-Pyarn -Phadoop-2.6 -Phive
-Phive-thriftserver". Clearly the test improvement efforts are paying
off.

As usual the license, sigs, etc are OK.

On Thu, Sep 24, 2015 at 8:27 AM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1148/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already present
> in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu

You can create connection like this:

val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
=> {
  val dbConnection = create a db connection
  iterator.flatMap { case (key, values, stateOption) =>
if (values.isEmpty) {
  // don't access database
} else {
  // update to new state and save to database
}
// return new state
  }
  TaskContext.get().addTaskCompletionListener(_ => db.disconnect())
}


Best Regards,
Shixiong Zhu

2015-09-24 17:42 GMT+08:00 Bin Wang :

> It seems like a work around. But I don't know how to get the database
> connection from the working nodes.
>
> Shixiong Zhu 于2015年9月24日周四 下午5:37写道：
>
>> Could you write your update func like this?
>>
>> val updateFunc = (iterator: Iterator[(String, Seq[Int],
>> Option[Int])]) => {
>>   iterator.flatMap { case (key, values, stateOption) =>
>> if (values.isEmpty) {
>>   // don't access database
>> } else {
>>   // update to new state and save to database
>> }
>> // return new state
>>   }
>> }
>>
>> and use this overload:
>>
>> def updateStateByKey[S: ClassTag](
>>   updateFunc: (Seq[V], Option[S]) => Option[S],
>>   partitioner: Partitioner
>> ): DStream[(K, S)]
>>
>> There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
>> doesn't have a doc now...
>>
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-09-24 17:26 GMT+08:00 Bin Wang :
>>
>>> Data that are not updated should be saved earlier: while the data added
>>> to the DStream at the first time, it should be considered as updated. So
>>> save the same data again is a waste.
>>>
>>> What are the community is doing? Is there any doc or discussion that I
>>> can look for? Thanks.
>>>
>>>
>>>
>>> Shixiong Zhu 于2015年9月24日周四 下午4:27写道：
>>>
 For data that are not updated, where do you save? Or do you only want
 to avoid accessing database for those that are not updated?

 Besides,  the community is working on optimizing "updateStateBykey"'s
 performance. Hope it will be delivered soon.

 Best Regards,
 Shixiong Zhu

 2015-09-24 13:45 GMT+08:00 Bin Wang :

> I've read the source code and it seems to be impossible, but I'd like
> to confirm it.
>
> It is a very useful feature. For example, I need to store the state of
> DStream into my database, in order to recovery them from next redeploy. 
> But
> I only need to save the updated ones. Save all keys into database is a lot
> of waste.
>
> Through the source code, I think it could be add easily: StateDStream
> can get prevStateRDD so that it can make a diff. Is there any chance to 
> add
> this as an API of StateDStream? If so, I can work on this feature.
>
> If not possible, is there any work around or hack to do this by myself?
>


>>

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang

It seems like a work around. But I don't know how to get the database
connection from the working nodes.

Shixiong Zhu 于2015年9月24日周四 下午5:37写道：

> Could you write your update func like this?
>
> val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
> => {
>   iterator.flatMap { case (key, values, stateOption) =>
> if (values.isEmpty) {
>   // don't access database
> } else {
>   // update to new state and save to database
> }
> // return new state
>   }
> }
>
> and use this overload:
>
> def updateStateByKey[S: ClassTag](
>   updateFunc: (Seq[V], Option[S]) => Option[S],
>   partitioner: Partitioner
> ): DStream[(K, S)]
>
> There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
> doesn't have a doc now...
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 17:26 GMT+08:00 Bin Wang :
>
>> Data that are not updated should be saved earlier: while the data added
>> to the DStream at the first time, it should be considered as updated. So
>> save the same data again is a waste.
>>
>> What are the community is doing? Is there any doc or discussion that I
>> can look for? Thanks.
>>
>>
>>
>> Shixiong Zhu 于2015年9月24日周四 下午4:27写道：
>>
>>> For data that are not updated, where do you save? Or do you only want to
>>> avoid accessing database for those that are not updated?
>>>
>>> Besides,  the community is working on optimizing "updateStateBykey"'s
>>> performance. Hope it will be delivered soon.
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-09-24 13:45 GMT+08:00 Bin Wang :
>>>
 I've read the source code and it seems to be impossible, but I'd like
 to confirm it.

 It is a very useful feature. For example, I need to store the state of
 DStream into my database, in order to recovery them from next redeploy. But
 I only need to save the updated ones. Save all keys into database is a lot
 of waste.

 Through the source code, I think it could be add easily: StateDStream
 can get prevStateRDD so that it can make a diff. Is there any chance to add
 this as an API of StateDStream? If so, I can work on this feature.

 If not possible, is there any work around or hack to do this by myself?

>>>
>>>
>

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu

Could you write your update func like this?

val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
=> {
  iterator.flatMap { case (key, values, stateOption) =>
if (values.isEmpty) {
  // don't access database
} else {
  // update to new state and save to database
}
// return new state
  }
}

and use this overload:

def updateStateByKey[S: ClassTag](
  updateFunc: (Seq[V], Option[S]) => Option[S],
  partitioner: Partitioner
): DStream[(K, S)]

There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
doesn't have a doc now...


Best Regards,
Shixiong Zhu

2015-09-24 17:26 GMT+08:00 Bin Wang :

> Data that are not updated should be saved earlier: while the data added to
> the DStream at the first time, it should be considered as updated. So save
> the same data again is a waste.
>
> What are the community is doing? Is there any doc or discussion that I can
> look for? Thanks.
>
>
>
> Shixiong Zhu 于2015年9月24日周四 下午4:27写道：
>
>> For data that are not updated, where do you save? Or do you only want to
>> avoid accessing database for those that are not updated?
>>
>> Besides,  the community is working on optimizing "updateStateBykey"'s
>> performance. Hope it will be delivered soon.
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-09-24 13:45 GMT+08:00 Bin Wang :
>>
>>> I've read the source code and it seems to be impossible, but I'd like to
>>> confirm it.
>>>
>>> It is a very useful feature. For example, I need to store the state of
>>> DStream into my database, in order to recovery them from next redeploy. But
>>> I only need to save the updated ones. Save all keys into database is a lot
>>> of waste.
>>>
>>> Through the source code, I think it could be add easily: StateDStream
>>> can get prevStateRDD so that it can make a diff. Is there any chance to add
>>> this as an API of StateDStream? If so, I can work on this feature.
>>>
>>> If not possible, is there any work around or hack to do this by myself?
>>>
>>
>>

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang

Data that are not updated should be saved earlier: while the data added to
the DStream at the first time, it should be considered as updated. So save
the same data again is a waste.

What are the community is doing? Is there any doc or discussion that I can
look for? Thanks.



Shixiong Zhu 于2015年9月24日周四 下午4:27写道：

> For data that are not updated, where do you save? Or do you only want to
> avoid accessing database for those that are not updated?
>
> Besides,  the community is working on optimizing "updateStateBykey"'s
> performance. Hope it will be delivered soon.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 13:45 GMT+08:00 Bin Wang :
>
>> I've read the source code and it seems to be impossible, but I'd like to
>> confirm it.
>>
>> It is a very useful feature. For example, I need to store the state of
>> DStream into my database, in order to recovery them from next redeploy. But
>> I only need to save the updated ones. Save all keys into database is a lot
>> of waste.
>>
>> Through the source code, I think it could be add easily: StateDStream can
>> get prevStateRDD so that it can make a diff. Is there any chance to add
>> this as an API of StateDStream? If so, I can work on this feature.
>>
>> If not possible, is there any work around or hack to do this by myself?
>>
>
>

Re: Checkpoint directory structure

2015-09-24 Thread Tathagata Das

Thanks for the log file. Unfortunately, this is insufficient as it does not
why the file does not exist. It could be that before failure somehow the
file was deleted. For that I need to see both the before failure and after
recovery logs. If this can be reproduced, could you generate the before and
after failure logs?

On Wed, Sep 23, 2015 at 7:33 PM, Bin Wang  wrote:

> I've attached the full log. The error is like this:
>
> 15/09/23 17:47:39 ERROR yarn.ApplicationMaster: User class threw
> exception: java.lang.IllegalArgumentException: requirement failed:
> Checkpoint directory does not exist: hdfs://
> szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
> java.lang.IllegalArgumentException: requirement failed: Checkpoint
> directory does not exist: hdfs://
> szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:45)
> at
> org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
> at
> org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
> at org.apache.spark.SparkContext.checkpointFile(SparkContext.scala:1226)
> at
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:112)
> at
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:109)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at
> org.apache.spark.streaming.dstream.DStreamCheckpointData.restore(DStreamCheckpointData.scala:109)
> at
> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:487)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.streaming.DStreamGraph.restoreCheckpointData(DStreamGraph.scala:153)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:158)
> at
> org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
> at
> org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:837)
> at com.appadhoc.data.main.StatCounter$.main(StatCounter.scala:51)
> at com.appadhoc.data.main.StatCounter.main(StatCounter.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> 15/09/23 17:47:39 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu

For data that are not updated, where do you save? Or do you only want to
avoid accessing database for those that are not updated?

Besides,  the community is working on optimizing "updateStateBykey"'s
performance. Hope it will be delivered soon.

Best Regards,
Shixiong Zhu

2015-09-24 13:45 GMT+08:00 Bin Wang :

> I've read the source code and it seems to be impossible, but I'd like to
> confirm it.
>
> It is a very useful feature. For example, I need to store the state of
> DStream into my database, in order to recovery them from next redeploy. But
> I only need to save the updated ones. Save all keys into database is a lot
> of waste.
>
> Through the source code, I think it could be add easily: StateDStream can
> get prevStateRDD so that it can make a diff. Is there any chance to add
> this as an API of StateDStream? If so, I can work on this feature.
>
> If not possible, is there any work around or hack to do this by myself?
>

[VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version
1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.1
[ ] -1 Do not release this package because ...


The release fixes 81 known issues in Spark 1.5.0, listed here:
http://s.apache.org/spark-1.5.1

The tag to be voted on is v1.5.1-rc1:
https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (1.5.1) can be found at:
*https://repository.apache.org/content/repositories/orgapachespark-1148/
*

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.0. Bugs already present
in 1.5.0 will not block this release.

===
What should happen to JIRA tickets still targeting 1.5.1?
===
Please target 1.5.2 or 1.6.0.

44 matches

Mail list logo