Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding)
built from source and ran some jobs against YARN

-Sandy

On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan vaquar.k...@gmail.com wrote:


 +1 (1.5.0 RC2)Compiled on Windows with YARN.

 Regards,
 Vaquar khan
 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA') OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. joins,sql,set operations,udf OK

 Cheers
 k/

 On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc2:

 https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release (published as 1.5.0-rc2) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1141/

 The staging repository for this release (published as 1.5.0) can be found
 at:
 https://repository.apache.org/content/repositories/orgapachespark-1140/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


 ===
 How can I help test this release?
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.


 
 What justifies a -1 vote for this release?
 
 This vote is happening towards the end of the 1.5 QA period, so -1 votes
 should only occur for significant regressions from 1.4. Bugs already
 present in 1.4, minor regressions, or bugs related to new features will not
 block this release.


 ===
 What should happen to JIRA tickets still targeting 1.5.0?
 ===
 1. It is OK for documentation patches to target 1.5.0 and still go into
 branch-1.5, since documentations will be packaged separately from the
 release.
 2. New features for non-alpha-modules should target 1.6+.
 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
 version.


 ==
 Major changes to help you focus your testing
 ==

 As of today, Spark 1.5 contains more than 1000 commits from 220+
 contributors. I've curated a list of important changes for 1.5. For the
 complete list, please refer to Apache JIRA changelog.

 RDD/DataFrame/SQL APIs

 - New UDAF interface
 - DataFrame hints for broadcast join
 - expr function for turning a SQL expression into DataFrame column
 - Improved support for NaN values
 - StructType now supports ordering
 - TimestampType precision is reduced to 1us
 - 100 new built-in expressions, including date/time, string, math
 - memory and local disk only checkpointing

 DataFrame/SQL Backend Execution

 - Code generation on by default
 - Improved join, 

Re: Tungsten off heap memory access for C++ libraries

2015-08-30 Thread Paul Weiss
Reynold,

That is great to hear.  Definitely interested in how 2. is being
implemented and how it will be exposed in C++.  One important aspect of
leveraging the off heap memory is how the data is organized as well as
being able to easily access it from the C++ side.  For example how would
you store a multi dimensional array of doubles and how would you specify
that?  Perhaps Avro or Protobuf could be used for storing complex nested
structures although making that a zero copy could be a challenge.
Regardless of how the internals lays the data out in memory the important
requirements are:

a) ensuring zero copy
b) providing a friendly api on the C++ side so folks don't have to deal
with raw bytes, serialization, and JNI
c) ability to specify a complex (multi type and nested) structure via a
schema for memory storage (compile time generated would be sufficient but
run time dynamically would be extremely flexible)

Perhaps a simple way to accomplish would be to enhance dataframes to have a
C++ api that can access the off-heap memory in a clean way from Spark (in
process and w/ zero copy).

Also, is this work being done on a branch I could look into further and try
out?

thanks,
-paul



On Sat, Aug 29, 2015 at 9:40 PM, Reynold Xin r...@databricks.com wrote:

 Supporting non-JVM code without memory copying and serialization is
 actually one of the motivations behind Tungsten. We didn't talk much about
 it since it is not end-user-facing and it is still too early. There are a
 few challenges still:

 1. Spark cannot run entirely in off-heap mode (by entirely here I'm
 referring to all the data-plane memory, not control-plane such as RPCs
 since those don't matter much). There is nothing fundamental. It just takes
 a while to make sure all code paths allocate/free memory using the proper
 allocators.

 2. The memory layout of data is still in flux, since we are only 4 months
 into Tungsten. They will change pretty frequently for the foreseeable
 future, and as a result, the C++ side of things will have change as well.



 On Sat, Aug 29, 2015 at 12:29 PM, Timothy Chen tnac...@gmail.com wrote:

 I would also like to see data shared off-heap to a 3rd party C++
 library with JNI, I think the complications would be how to memory
 manage this and make sure the 3rd party libraries also adhere to the
 access contracts as well.

 Tim

 On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss paulweiss@gmail.com
 wrote:
  Hi,
 
  Would the benefits of project tungsten be available for access by
 non-JVM
  programs directly into the off-heap memory?  Spark using dataframes w/
 the
  tungsten improvements will definitely help analytics within the JVM
 world
  but accessing outside 3rd party c++ libraries is a challenge especially
 when
  trying to do it with a zero copy.
 
  Ideally the off heap memory would be accessible to a non JVM program
 and be
  invoked in process using JNI per each partition.  The alternatives to
 this
  involve additional costs of starting another process if using pipes as
 well
  as the additional copy all the data.
 
  In addition to read only non-JVM access in process would there be a way
 to
  share the dataframe that is in memory out of process and across spark
  contexts.  This way an expensive complicated initial build up of a
 dataframe
  would not have to be replicated as well not having to pay the penalty
 of the
  startup costs on failure.
 
  thanks,
 
  -paul
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





[ANNOUNCE] New testing capabilities for pull requests

2015-08-30 Thread Patrick Wendell
Hi All,

For pull requests that modify the build, you can now test different
build permutations as part of the pull request builder. To trigger
these, you add a special phrase to the title of the pull request.
Current options are:

[test-maven] - run tests using maven and not sbt
[test-hadoop1.0] - test using older hadoop versions (can use 1.0, 2.0,
2.2, and 2.3).

The relevant source code is here:
https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L193

This is useful because it allows up-front testing of build changes to
avoid breaks once a patch has already been merged.

I've documented this on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org