[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...

FlytxtRnD Mon, 04 May 2015 23:35:59 -0700

GitHub user FlytxtRnD reopened a pull request:

    https://github.com/apache/spark/pull/5647


    [SPARK-6612] [MLLib] [PySpark] Python KMeans parity

    The following items are added to Python kmeans:
    
    kmeans - setEpsilon, setInitializationSteps
    KMeansModel - computeCost, k

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/FlytxtRnD/spark newPyKmeansAPI

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5647.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5647
    
----
commit b61939a685f5cbdd6b0ef655b1d5a825f5646782
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-22T10:29:48Z

    Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.

commit 990383761841b444506e91f3052c2de3736d6052
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-22T11:31:10Z

    added arguments in python tests

commit 1084663d0217b7adac40fb63b991476086ebd1fa
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-28T04:59:15Z

    python 3 fixes

commit 7ecfd000af37899a920cae838cc41bcc5ceca053
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-28T05:02:01Z

    Merge remote-tracking branch 'upstream/master' into newPyKmeansAPI

commit 703e8f609a8eb81b2a1b2492611909a562b0fbed
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-29T03:53:17Z

    doc test corrections

commit d6d3a093719fb5ba606996b35cb3da2dfbf90c1f
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-29T03:54:34Z

    Merge remote-tracking branch 'upstream/master' into newPyKmeansAPI

commit 9351b62f16371b538ab0715461011bfcba2cea31
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-30T04:37:39Z

    set seed to fixed value in doc test

commit 0319821db7406f3cca359af5bc021d2f3fd92a17
Author: Hrishikesh Subramonian <hrishikesh.subramon...@flytxt.com>
Date:   2015-04-30T04:41:13Z

    Merge remote-tracking branch 'upstream/master' into newPyKmeansAPI

commit ba49eb1625b1190d8aaf2c55dc1f6309ac3e080c
Author: DB Tsai <d...@netflix.com>
Date:   2015-04-30T04:44:41Z

    Some code clean up.
    
    Author: DB Tsai <d...@netflix.com>
    
    Closes #5794 from dbtsai/clean and squashes the following commits:
    
    ad639dd [DB Tsai] Indentation
    834d527 [DB Tsai] Some code clean up.

commit 4459514497eb76e6f2465d071857854390453805
Author: Zhongshuai Pei <799203...@qq.com>
Date:   2015-04-30T05:44:14Z

    [SPARK-7225][SQL] CombineLimits optimizer does not work
    
    SQL
    ```
    select key from (select key from src limit 100) t2 limit 10
    ```
    Optimized Logical Plan before modifying
    ```
    == Optimized Logical Plan ==
    Limit 10
    Limit 100
    Project key#3
    MetastoreRelation default, src, None
    ```
    Optimized Logical Plan after modifying
    ```
    == Optimized Logical Plan ==
    Limit 10
     Project [key#1]
      MetastoreRelation default, src, None
    ```
    
    Author: Zhongshuai Pei <799203...@qq.com>
    Author: DoingDone9 <799203...@qq.com>
    
    Closes #5770 from DoingDone9/limitOptimizer and squashes the following 
commits:
    
    c68eaa7 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
    97e18cf [Zhongshuai Pei] Update Optimizer.scala
    19ab875 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
    7db4566 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
    e2a491d [Zhongshuai Pei] Update Optimizer.scala
    f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
    f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
    f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
    34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
    802261c [DoingDone9] Merge pull request #7 from apache/master
    d00303b [DoingDone9] Merge pull request #6 from apache/master
    98b134f [DoingDone9] Merge pull request #5 from apache/master
    161cae3 [DoingDone9] Merge pull request #4 from apache/master
    c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
    cb1852d [DoingDone9] Merge pull request #2 from apache/master
    c3f046f [DoingDone9] Merge pull request #1 from apache/master

commit 254e0509762937acc9c72b432d5d953bf72c3c52
Author: Vincenzo Selvaggio <vselvag...@hotmail.it>
Date:   2015-04-30T06:21:21Z

    [SPARK-1406] Mllib pmml model export
    
    See PDF attached to the JIRA issue 1406.
    
    The contribution is my original work and I license the work to the project 
under the project's open source license.
    
    Author: Vincenzo Selvaggio <vselvag...@hotmail.it>
    Author: Xiangrui Meng <m...@databricks.com>
    Author: selvinsource <vselvag...@hotmail.it>
    
    Closes #3062 from selvinsource/mllib_pmml_model_export_SPARK-1406 and 
squashes the following commits:
    
    852aac6 [Vincenzo Selvaggio] [SPARK-1406] Update JPMML version to 1.1.15 in 
LICENSE file
    085cf42 [Vincenzo Selvaggio] [SPARK-1406] Added Double Min and Max Fixed 
scala style
    30165c4 [Vincenzo Selvaggio] [SPARK-1406] Fixed extreme cases for logit
    7a5e0ec [Vincenzo Selvaggio] [SPARK-1406] Binary classification for SVM and 
Logistic Regression
    cfcb596 [Vincenzo Selvaggio] [SPARK-1406] Throw IllegalArgumentException 
when exporting a multinomial logistic regression
    25dce33 [Vincenzo Selvaggio] [SPARK-1406] Update code to latest pmml model
    dea98ca [Vincenzo Selvaggio] [SPARK-1406] Exclude transitive dependency for 
pmml model
    66b7c12 [Vincenzo Selvaggio] [SPARK-1406] Updated pmml model lib to 1.1.15, 
latest Java 6 compatible
    a0a55f7 [Vincenzo Selvaggio] Merge pull request #2 from mengxr/SPARK-1406
    3c22f79 [Xiangrui Meng] more code style
    e2313df [Vincenzo Selvaggio] Merge pull request #1 from mengxr/SPARK-1406
    472d757 [Xiangrui Meng] fix code style
    1676e15 [Vincenzo Selvaggio] fixed scala issue
    e2ffae8 [Vincenzo Selvaggio] fixed scala style
    b8823b0 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' 
into mllib_pmml_model_export_SPARK-1406
    b25bbf7 [Vincenzo Selvaggio] [SPARK-1406] Added export of pmml to 
distributed file system using the spark context
    7a949d0 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
    f46c75c [Vincenzo Selvaggio] [SPARK-1406] Added PMMLExportable to supported 
models
    7b33b4e [Vincenzo Selvaggio] [SPARK-1406] Added a PMMLExportable interface 
Restructured code in a new package mllib.pmml Supported models implements the 
new PMMLExportable interface: LogisticRegression, SVM, KMeansModel, 
LinearRegression, RidgeRegression, Lasso
    d559ec5 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' 
into mllib_pmml_model_export_SPARK-1406
    8fe12bb [Vincenzo Selvaggio] [SPARK-1406] Adjusted logistic regression 
export description and target categories
    03bc3a5 [Vincenzo Selvaggio] added logistic regression
    da2ec11 [Vincenzo Selvaggio] [SPARK-1406] added linear SVM PMML export
    82f2131 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' 
into mllib_pmml_model_export_SPARK-1406
    19adf29 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
    1faf985 [Vincenzo Selvaggio] [SPARK-1406] Added target field to the 
regression model for completeness Adjusted unit test to deal with this change
    3ae8ae5 [Vincenzo Selvaggio] [SPARK-1406] Adjusted imported order according 
to the guidelines
    c67ce81 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' 
into mllib_pmml_model_export_SPARK-1406
    78515ec [Vincenzo Selvaggio] [SPARK-1406] added pmml export for 
LinearRegressionModel, RidgeRegressionModel and LassoModel
    e29dfb9 [Vincenzo Selvaggio] removed version, by default is set to 4.2 
(latest from jpmml) removed copyright
    ae8b993 [Vincenzo Selvaggio] updated some commented tests to use the new 
ModelExporter object reordered the imports
    df8a89e [Vincenzo Selvaggio] added pmml version to pmml model changed the 
copyright to spark
    a1b4dc3 [Vincenzo Selvaggio] updated imports
    834ca44 [Vincenzo Selvaggio] reordered the import accordingly to the 
guidelines
    349a76b [Vincenzo Selvaggio] new helper object to serialize the models to 
pmml format
    c3ef9b8 [Vincenzo Selvaggio] set it to private
    6357b98 [Vincenzo Selvaggio] set it to private
    e1eb251 [Vincenzo Selvaggio] removed serialization part, this will be part 
of the ModelExporter helper object
    aba5ee1 [Vincenzo Selvaggio] fixed cluster export
    cd6c07c [Vincenzo Selvaggio] fixed scala style to run tests
    f75b988 [Vincenzo Selvaggio] Merge remote-tracking branch 'origin/master' 
into mllib_pmml_model_export_SPARK-1406
    07a29bf [selvinsource] Update LICENSE
    8841439 [Vincenzo Selvaggio] adjust scala style in order to compile
    1433b11 [Vincenzo Selvaggio] complete suite tests
    8e71b8d [Vincenzo Selvaggio] kmeans pmml export implementation
    9bc494f [Vincenzo Selvaggio] added scala suite tests added saveLocalFile to 
ModelExport trait
    226e184 [Vincenzo Selvaggio] added javadoc and export model type in case 
there is a need to support other types of export (not just PMML)
    a0e3679 [Vincenzo Selvaggio] export and pmml export traits kmeans test 
implementation

commit 47bf406d608c4777f5f383ba439608f673034a1d
Author: Patrick Wendell <patr...@databricks.com>
Date:   2015-04-30T08:02:33Z

    [HOTFIX] Disabling flaky test (fix in progress as part of SPARK-7224)

commit 7dacc08ab36188991a001df23880167433844767
Author: Burak Yavuz <brk...@gmail.com>
Date:   2015-04-30T17:19:08Z

    [SPARK-7224] added mock repository generator for --packages tests
    
    This patch contains an `IvyTestUtils` file, which dynamically generates 
jars and pom files to test the `--packages` feature without having to rely on 
the internet, and Maven Central.
    
    cc pwendell I know that there existed Util functions to create Jars and 
stuff already, but they didn't really serve my purposes as they appended random 
prefixes that was breaking things.
    
    I also added the local repository tests. Notice that they work without 
passing the `repo` to `resolveMavenCoordinates`.
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #5790 from brkyvz/maven-utils and squashes the following commits:
    
    3ec79b7 [Burak Yavuz] addressed comments v0.2
    a39151b [Burak Yavuz] address comments v0.1
    172dfef [Burak Yavuz] use Ivy format
    7476d06 [Burak Yavuz] added mock repository generator

commit 6c65da6bb7d1213e6a4a9f7fd1597d029d87d07c
Author: Hari Shreedharan <hshreedha...@apache.org>
Date:   2015-04-30T18:03:23Z

    [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
    
    Current Spark apps running on Secure YARN/HDFS would not be able to write 
data
    to HDFS after 7 days, since delegation tokens cannot be renewed beyond 
that. This
    means Spark Streaming apps will not be able to run on Secure YARN.
    
    This commit adds basic functionality to fix this issue. In this patch:
    - new parameters are added - principal and keytab, which can be used to 
login to a KDC
    - the client logs in, and then get tokens to start the AM
    - the keytab is copied to the staging directory
    - the AM waits for 60% of the time till expiry of the tokens and then logs 
in using the keytab
    - each time after 60% of the time, new tokens are created and sent to the 
executors
    
    Currently, to avoid complicating the architecture, we set the keytab and 
principal in the
    SparkHadoopUtil singleton, and schedule a login. Once the login is 
completed, a callback is scheduled.
    
    This is being posted for feedback, so I can gather feedback on the general 
implementation.
    
    There are currently a bunch of things to do:
    - [x] logging
    - [x] testing - I plan to manually test this soon. If you have ideas of how 
to add unit tests, comment.
    - [x] add code to ensure that if these params are set in non-YARN cluster 
mode, we complain
    - [x] documentation
    - [x] Have the executors request for credentials from the AM, so that 
retries are possible.
    
    Author: Hari Shreedharan <hshreedha...@apache.org>
    
    Closes #4688 from harishreedharan/kerberos-longrunning and squashes the 
following commits:
    
    36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a 
bunch of comments.
    611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for 
creating tokens.
    09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval 
rather than using hdfs-site.xml
    6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user 
class. Else, a restarted AM cannot access HDFS if the user class tries to.
    072659e [Hari Shreedharan] Fix build failure caused by thread factory 
getting moved to ThreadUtils.
    f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods 
around, use renewal interval rather than max lifetime to create new tokens.
    ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. 
Minor doc updates.
    7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some 
code to ensure that the renewer objects are created only if required.
    e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and 
send new tokens via NewTokens message.
    0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
    bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into 
separate classes, also send tokens via akka on executor startup.
    f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add 
cleanup for old credentials files.
    2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on 
AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
    61b2b27 [Hari Shreedharan] Account for AM restarts by making sure 
lastSuffix is read from the files on HDFS.
    62c45ce [Hari Shreedharan] Relogin from keytab periodically.
    fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and 
ordering issues.
    42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge 
with master.
    0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
    9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh 
works, some other minor fixes.
    f4fd711 [Hari Shreedharan] Fix SparkConf usage.
    2debcea [Hari Shreedharan] Change the file structure for credentials files. 
I will push a followup patch which adds a cleanup mechanism for old credentials 
files. The credentials files are small and few enough for it to cause issues on 
HDFS.
    af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
    f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials 
file.
    f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead 
the executors check a known file's modification time to read the credentials.
    5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile 
issues.
    b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather 
than directly via HDFS
    0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when 
required, rather than sending them over the wire.
    d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to 
FileSystem#addDelegationTokens()
    8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor 
object.
    fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before 
CoarseGrainedSchedulerBackend is started. Also schedule re-logins in 
CoarseGrainedSchedulerBackend#start()
    41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
    d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the 
correct class is used in tests.
    bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in 
SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
    f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
    2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps 
to run on secure YARN/HDFS.
    ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
    77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN 
client.

commit adbdb19a7d2cc939795f0cecbdc07c605dc946c1
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2015-04-30T21:39:27Z

    [SPARK-7207] [ML] [BUILD] Added ml.recommendation, ml.regression to 
SparkBuild
    
    Added ml.recommendation, ml.regression to SparkBuild
    
    CC: mengxr
    
    Author: Joseph K. Bradley <jos...@databricks.com>
    
    Closes #5758 from jkbradley/SPARK-7207 and squashes the following commits:
    
    a28158a [Joseph K. Bradley] Added ml.recommendation, ml.regression to 
SparkBuild

commit e0628f2fae7f99d096f9dd625876a60d11020d9b
Author: Patrick Wendell <patr...@databricks.com>
Date:   2015-04-30T21:59:20Z

    Revert "[SPARK-5342] [YARN] Allow long running Spark apps to run on secure 
YARN/HDFS"
    
    This reverts commit 6c65da6bb7d1213e6a4a9f7fd1597d029d87d07c.

commit 6702324b60f99dab55912c08ccd3d03805f6b7b2
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2015-04-30T22:13:43Z

    [SPARK-7196][SQL] Support precision and scale of decimal type for JDBC
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-7196
    
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #5777 from viirya/jdbc_precision and squashes the following commits:
    
    f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type.
    49acbf9 [Liang-Chi Hsieh] Add unit test.
    a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for 
JDBC.

commit 07a86205f9efc43ea1ec5edb97c21c32abe7fb8a
Author: Josh Rosen <joshro...@databricks.com>
Date:   2015-04-30T22:21:00Z

    [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add 
facade in front of Unsafe; remove use of Unsafe.setMemory
    
    This patch suppresses compiler warnings due to our use of `sun.misc.Unsafe` 
(introduced in #5725).  These warnings can only be suppressed via the 
`-XDignore.symbol.file` javac flag; the `SuppressWarnings` annotation won't 
work for these.
    
    In order to restrict uses of this compiler flag to the `unsafe` module, I 
placed a facade in front of `Unsafe` so that other modules won't call it 
directly. This facade also will also help us to avoid accidental usage of 
deprecated Unsafe methods or methods that aren't supported in Java 6.
    
    I also removed an unnecessary use of `Unsafe.setMemory`, which isn't 
present in certain versions of Java 6, and excluded the new `unsafe` module 
from Javadoc.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #5814 from JoshRosen/unsafe-compiler-warnings-fixes and squashes the 
following commits:
    
    9e8c483 [Josh Rosen] Exclude new unsafe module from Javadoc
    ba75ecf [Josh Rosen] Only apply -XDignore.symbol.file flag in unsafe 
project.
    7403345 [Josh Rosen] Put facade in front of Unsafe.
    50230c0 [Josh Rosen] Remove usage of Unsafe.setMemory
    96d41c9 [Josh Rosen] Use -XDignore.symbol.file to suppress warnings about 
sun.misc.Unsafe usage

commit 77cc25fb7473d8a06b727d2ba5ee62db1c651cf8
Author: Zhongshuai Pei <799203...@qq.com>
Date:   2015-04-30T22:22:13Z

    [SPARK-7267][SQL]Push down Project when it's child is Limit
    
    SQL
    ```
    select key from (select key,value from t1 limit 100) t2 limit 10
    ```
    Optimized Logical Plan before modifying
    ```
    == Optimized Logical Plan ==
    Limit 10
      Project key#228
        Limit 100
          MetastoreRelation default, t1, None
    ```
    Optimized Logical Plan after modifying
    ```
    == Optimized Logical Plan ==
    Limit 10
      Limit 100
        Project key#228
          MetastoreRelation default, t1, None
    ```
    After this, we can combine limits
    
    Author: Zhongshuai Pei <799203...@qq.com>
    Author: DoingDone9 <799203...@qq.com>
    
    Closes #5797 from DoingDone9/ProjectLimit and squashes the following 
commits:
    
    70d0fca [Zhongshuai Pei] Update FilterPushdownSuite.scala
    dc83ae9 [Zhongshuai Pei] Update FilterPushdownSuite.scala
    485c61c [Zhongshuai Pei] Update Optimizer.scala
    f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
    f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
    f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
    34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
    802261c [DoingDone9] Merge pull request #7 from apache/master
    d00303b [DoingDone9] Merge pull request #6 from apache/master
    98b134f [DoingDone9] Merge pull request #5 from apache/master
    161cae3 [DoingDone9] Merge pull request #4 from apache/master
    c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
    cb1852d [DoingDone9] Merge pull request #2 from apache/master
    c3f046f [DoingDone9] Merge pull request #1 from apache/master

commit fa01bec484fc000e0a31645b722ffde48556c4df
Author: Josh Rosen <joshro...@databricks.com>
Date:   2015-04-30T23:23:01Z

    [Build] Enable MiMa checks for SQL
    
    Now that 1.3 has been released, we should enable MiMa checks for the `sql` 
subproject.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the 
following commits:
    
    3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
enable-more-mima-checks
    0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into 
enable-more-mima-checks
    e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql]
    44d0d01 [Josh Rosen] Add back 'launcher' exclude
    1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.

commit 1c3e402e669d047410b00de9193adf3c329844a2
Author: DB Tsai <d...@netflix.com>
Date:   2015-04-30T23:26:51Z

    [SPARK-7279] Removed diffSum which is theoretical zero in LinearRegression 
and coding formating
    
    Author: DB Tsai <d...@netflix.com>
    
    Closes #5809 from dbtsai/format and squashes the following commits:
    
    6904eed [DB Tsai] triger jenkins
    9146e19 [DB Tsai] initial commit

commit 149b3ee2dac992355adbe44e989570726c1f35d0
Author: Burak Yavuz <brk...@gmail.com>
Date:   2015-04-30T23:40:32Z

    [SPARK-7242][SQL][MLLIB] Frequent items for DataFrames
    
    Finding frequent items with possibly false positives, using the algorithm 
described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    public API under:
    ```
    df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame
    ```
    
    The output is a local DataFrame having the input column names with 
`-freqItems` appended to it. This is a single pass algorithm that may return 
false positives, but no false negatives.
    
    cc mengxr rxin
    
    Let's get the implementations in, I can add python API in a follow up PR.
    
    Author: Burak Yavuz <brk...@gmail.com>
    
    Closes #5799 from brkyvz/freq-items and squashes the following commits:
    
    a6ec82c [Burak Yavuz] addressed comments v?
    39b1bba [Burak Yavuz] removed toSeq
    0915e23 [Burak Yavuz] addressed comments v2.1
    3a5c177 [Burak Yavuz] addressed comments v2.0
    482e741 [Burak Yavuz] removed old import
    38e784d [Burak Yavuz] addressed comments v1.0
    8279d4d [Burak Yavuz] added default value for support
    3d82168 [Burak Yavuz] made base implementation

commit ee04413935f74b3178adbb6d8dee19b3320803e9
Author: rakeshchalasani <vnit.rak...@gmail.com>
Date:   2015-05-01T00:42:50Z

    [SPARK-7280][SQL] Add "drop" column/s on a data frame
    
    Takes a column name/s and returns a new DataFrame that drops a column/s.
    
    Author: rakeshchalasani <vnit.rak...@gmail.com>
    
    Closes #5818 from rakeshchalasani/SPARK-7280 and squashes the following 
commits:
    
    ce2ec09 [rakeshchalasani] Minor edit
    45c06f1 [rakeshchalasani] Change withColumnRename and format changes
    f68945a [rakeshchalasani] Minor fix
    0b9104d [rakeshchalasani] Drop one column at a time
    289afd2 [rakeshchalasani] [SPARK-7280][SQL] Add "drop" column/s on a data 
frame

commit 079733817f02c61ef814f5d9c0c8227498ff0058
Author: scwf <wangf...@huawei.com>
Date:   2015-05-01T01:15:56Z

    [SPARK-7093] [SQL] Using newPredicate in NestedLoopJoin to enable code 
generation
    
    Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to 
make it can make use of code generation
    
    Author: scwf <wangf...@huawei.com>
    
    Closes #5665 from scwf/NLP and squashes the following commits:
    
    d19dd31 [scwf] improvement
    a887c02 [scwf] improve for NLP boundCondition

commit a0d8a61ab198b8c0ddbb3072bbe1d0e1dabc3e45
Author: wangfei <wangf...@huawei.com>
Date:   2015-05-01T01:18:54Z

    [SPARK-7109] [SQL] Push down left side filter for left semi join
    
    Now in spark sql optimizer we only push down right side filter for left 
semi join, actually we can push down left side filter because left semi join is 
doing filter on left table essentially.
    
    Author: wangfei <wangf...@huawei.com>
    Author: scwf <wangf...@huawei.com>
    
    Closes #5677 from scwf/leftsemi and squashes the following commits:
    
    483d205 [wangfei] update with master to fix compile issue
    82df0e1 [wangfei] Merge branch 'master' of https://github.com/apache/spark 
into leftsemi
    d68a053 [wangfei] added apply
    8f48a3d [scwf] added test
    ebadaa9 [wangfei] left filter push down for left semi join

commit e991255e7203a0f7080efbd71f57574f46076711
Author: Vyacheslav Baranov <slavik.bara...@gmail.com>
Date:   2015-05-01T01:45:14Z

    [SPARK-6913][SQL] Fixed "java.sql.SQLException: No suitable driver found"
    
    Fixed `java.sql.SQLException: No suitable driver found` when loading 
DataFrame into Spark SQL if the driver is supplied with `--jars` argument.
    
    The problem is in `java.sql.DriverManager` class that can't access drivers 
loaded by Spark ClassLoader.
    
    Wrappers that forward requests are created for these drivers.
    
    Also, it's not necessary any more to include JDBC drivers in 
`--driver-class-path` in local mode, specifying in `--jars` argument is 
sufficient.
    
    Author: Vyacheslav Baranov <slavik.bara...@gmail.com>
    
    Closes #5782 from SlavikBaranov/SPARK-6913 and squashes the following 
commits:
    
    510c43f [Vyacheslav Baranov] [SPARK-6913] Fixed review comments
    b2a727c [Vyacheslav Baranov] [SPARK-6913] Fixed thread race on driver 
registration
    c8294ae [Vyacheslav Baranov] [SPARK-6913] Fixed "No suitable driver found" 
when using using JDBC driver added with SparkContext.addJar

commit 3ba5aaab8266822545ac82b9e733fd25cc215a77
Author: Cheng Hao <hao.ch...@intel.com>
Date:   2015-05-01T01:49:06Z

    [SPARK-5213] [SQL] Pluggable SQL Parser Support
    
    This PR aims to make the SQL Parser Pluggable, and user can register it's 
own parser via Spark SQL CLI.
    
    ```
    # add the jar into the classpath
    $hchengmydesktop:spark>bin/spark-sql --jars sql99.jar
    
    -- switch to "hiveql" dialect
       spark-sql>SET spark.sql.dialect=hiveql;
       spark-sql>SELECT * FROM src LIMIT 1;
    
    -- switch to "sql" dialect
       spark-sql>SET spark.sql.dialect=sql;
       spark-sql>SELECT * FROM src LIMIT 1;
    
    -- switch to a custom dialect
       spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect;
       spark-sql>SELECT * FROM src LIMIT 1;
    
    -- register the non-exist SQL dialect
       spark-sql> SET spark.sql.dialect=NotExistedClass;
       spark-sql> SELECT * FROM src LIMIT 1;
    -- Exception will be thrown and switch to default sql dialect ("sql" for 
SQLContext and "hiveql" for HiveContext)
    ```
    
    Author: Cheng Hao <hao.ch...@intel.com>
    
    Closes #4015 from chenghao-intel/sqlparser and squashes the following 
commits:
    
    493775c [Cheng Hao] update the code as feedback
    81a731f [Cheng Hao] remove the unecessary comment
    aab0b0b [Cheng Hao] polish the code a little bit
    49b9d81 [Cheng Hao] shrink the comment for rebasing

commit 473552fa5db9fa81f1a800f4ebacd23472e8c212
Author: scwf <wangf...@huawei.com>
Date:   2015-05-01T01:50:14Z

    [SPARK-7123] [SQL] support table.star in sqlcontext
    
    Run following sql get error
    `SELECT r.*
    FROM testData l join testData2 r on (l.key = r.a)`
    
    Author: scwf <wangf...@huawei.com>
    
    Closes #5690 from scwf/tablestar and squashes the following commits:
    
    3b2e2b6 [scwf] support table.star

commit beeafcfd6ee1e460c4d564cd1515d8781989b422
Author: Patrick Wendell <patr...@databricks.com>
Date:   2015-05-01T03:33:36Z

    Revert "[SPARK-5213] [SQL] Pluggable SQL Parser Support"
    
    This reverts commit 3ba5aaab8266822545ac82b9e733fd25cc215a77.

commit 69a739c7f5fd002432ece203957e1458deb2f4c3
Author: zsxwing <zsxw...@gmail.com>
Date:   2015-05-01T04:32:11Z

    [SPARK-7282] [STREAMING] Fix the race conditions in StreamingListenerSuite
    
    Fixed the following flaky test
    ```Scala
    [info] StreamingListenerSuite:
    [info] - batch info reporting (782 milliseconds)
    [info] - receiver info reporting *** FAILED *** (3 seconds, 911 
milliseconds)
    [info]   The code passed to eventually never returned normally. Attempted 
10 times over 3.4735783689999997 seconds. Last failure message: 0 did not equal 
1. (StreamingListenerSuite.scala:104)
    [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
    [info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
    [info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
    [info]   at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    [info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
    [info]   at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply$mcV$sp(StreamingListenerSuite.scala:104)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
    [info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
    [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
    [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
    [info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.runTest(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    [info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    [info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    [info]   at scala.collection.immutable.List.foreach(List.scala:318)
    [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    [info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    [info]   at 
org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
    [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
    [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
    [info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingListenerSuite.scala:34)
    [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.run(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
    [info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
    [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
    [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
    [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    [info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    [info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    [info]   at java.lang.Thread.run(Thread.java:745)
    [info]   Cause: org.scalatest.exceptions.TestFailedException: 0 did not 
equal 1
    [info]   at 
org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
    [info]   at 
org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
    [info]   at 
org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6277)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamingListenerSuite.scala:105)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(StreamingListenerSuite.scala:104)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(StreamingListenerSuite.scala:104)
    [info]   at 
org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:394)
    [info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:408)
    [info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
    [info]   at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    [info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
    [info]   at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply$mcV$sp(StreamingListenerSuite.scala:104)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
    [info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
    [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
    [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
    [info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.runTest(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    [info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    [info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    [info]   at scala.collection.immutable.List.foreach(List.scala:318)
    [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    [info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    [info]   at 
org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
    [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
    [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
    [info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    [info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingListenerSuite.scala:34)
    [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
    [info]   at 
org.apache.spark.streaming.StreamingListenerSuite.run(StreamingListenerSuite.scala:34)
    [info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
    [info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
    [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
    [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
    [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    [info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    [info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    [info]   at java.lang.Thread.run(Thread.java:745)
    ```
    
    The original codes didn't have a memory barrier in the `eventually` 
closure, which might fail the test, because JVM doesn't guarantee the memory 
consistency between different threads without  a memory barrier.
    
    This PR used `ConcurrentLinkedQueue` to set up the memory barrier.
    
    Author: zsxwing <zsxw...@gmail.com>
    
    Closes #5812 from zsxwing/SPARK-7282 and squashes the following commits:
    
    59115ef [zsxwing] Use SynchronizedBuffer
    014dd2b [zsxwing] Fix the race conditions in StreamingListenerSuite

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...

Reply via email to