+1 (non-binding)
built from source and ran some jobs against YARN
-Sandy
On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan vaquar.k...@gmail.com wrote:
+1 (1.5.0 RC2)Compiled on Windows with YARN.
Regards,
Vaquar khan
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
Center And Scale OK
2.5. RDD operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
itertools OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql(SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK
Cheers
k/
On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.5.0-rc2:
https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (published as 1.5.0-rc2) can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1141/
The staging repository for this release (published as 1.5.0) can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1140/
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.
What justifies a -1 vote for this release?
This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.
===
What should happen to JIRA tickets still targeting 1.5.0?
===
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
release.
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
version.
==
Major changes to help you focus your testing
==
As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default
- Improved join,