spark git commit: [SPARK-13761][ML] Remove remaining uses of validateParams

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 4c08e2c08 -> b39e80d39 [SPARK-13761][ML] Remove remaining uses of validateParams ## What changes were proposed in this pull request? Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and

spark git commit: [SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master ca9ef86c8 -> 917f4000b [SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild ## What changes were proposed in this pull request? This https://github.com/apache/spark/pull/2400 added the support to

spark git commit: [SPARK-13826][SQL] Addendum: update documentation for Datasets

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master 750ed64cd -> bb1fda01f [SPARK-13826][SQL] Addendum: update documentation for Datasets ## What changes were proposed in this pull request? This patch updates documentations for Datasets. I also updated some internal documentation for

spark git commit: [SPARK-12721][SQL] SQL Generation for Script Transformation

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 1d1de28a3 -> c4bd57602 [SPARK-12721][SQL] SQL Generation for Script Transformation What changes were proposed in this pull request? This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.

spark git commit: [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master b39e80d39 -> 1614485fd [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say

spark git commit: [SPARK-13038][PYSPARK] Add load/save to pipeline

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master c4bd57602 -> ae6c677c8 [SPARK-13038][PYSPARK] Add load/save to pipeline ## What changes were proposed in this pull request? JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038 1. Add load/save to PySpark Pipeline and

spark git commit: [SPARK-13838] [SQL] Clear variable code to prevent it to be re-evaluated in BoundAttribute

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 637a78f1d -> 5f3bda6fe [SPARK-13838] [SQL] Clear variable code to prevent it to be re-evaluated in BoundAttribute JIRA: https://issues.apache.org/jira/browse/SPARK-13838 ## What changes were proposed in this pull request? We should also

spark git commit: [SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 238fb485b -> 54794113a [SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader ## What changes were proposed in this pull request? This PR cleans up the new parquet record reader with the following changes: 1. Removes

spark git commit: [SPARK-13901][CORE] correct the logDebug information when jump to the next locality level

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 357d82d84 -> ea9ca6f04 [SPARK-13901][CORE] correct the logDebug information when jump to the next locality level JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901 In getAllowedLocalityLevel method of TaskSetManager,we get wrong

spark git commit: [SPARK-11011][SQL] Narrow type of UDT serialization

2016-03-19 Thread meng
Repository: spark Updated Branches: refs/heads/master 77ba3021c -> d4d84936f [SPARK-11011][SQL] Narrow type of UDT serialization ## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`,

spark git commit: [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…

2016-03-19 Thread vanzin
Repository: spark Updated Branches: refs/heads/master 5f6bdf97c -> eacd9d8ed [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON… … is not picked up in yarn-cluster mode Author: Jeff Zhang Closes #11238 from zjffdu/SPARK-13360. Project:

spark git commit: Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator"

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 82066a166 -> 30c18841e Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator" This reverts commit 99bd2f0e94657687834c5c59c4270c1484c9f595. Project:

[2/5] spark git commit: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread wenchen
http://git-wip-us.apache.org/repos/asf/spark/blob/8ef3399a/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala -- diff --git

spark git commit: [SPARK-13977] [SQL] Brings back Shuffled hash join

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 14c7236dc -> 9c23c818c [SPARK-13977] [SQL] Brings back Shuffled hash join ## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also

[1/5] spark git commit: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread wenchen
Repository: spark Updated Branches: refs/heads/master ea9ca6f04 -> 8ef3399af http://git-wip-us.apache.org/repos/asf/spark/blob/8ef3399a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala -- diff --git

spark git commit: [SPARK-12719][SQL] SQL generation support for Generate

2016-03-19 Thread wenchen
Repository: spark Updated Branches: refs/heads/master 8ef3399af -> 1974d1d34 [SPARK-12719][SQL] SQL generation support for Generate ## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into

spark git commit: [SPARK-13776][WEBUI] Limit the max number of acceptors and selectors for Jetty

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 1974d1d34 -> 65b75e66e [SPARK-13776][WEBUI] Limit the max number of acceptors and selectors for Jetty ## What changes were proposed in this pull request? As each acceptor/selector in Jetty will use one thread, the number of threads

spark git commit: [MINOR][DOC] Fix nits in JavaStreamingTestExample

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 0f1015ffd -> 53f32a22d [MINOR][DOC] Fix nits in JavaStreamingTestExample ## What changes were proposed in this pull request? Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419 use !rdd.isEmpty

spark git commit: [SPARK-10680][TESTS] Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master dcaa01661 -> d630a203d [SPARK-10680][TESTS] Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable ## What changes were proposed in this pull request? Increase 'connectionTimeout' to make

spark git commit: [SPARK-13976][SQL] do not remove sub-queries added by user when generate SQL

2016-03-19 Thread wenchen
Repository: spark Updated Branches: refs/heads/master 453455c47 -> 6037ed0a1 [SPARK-13976][SQL] do not remove sub-queries added by user when generate SQL ## What changes were proposed in this pull request? We haven't figured out the corrected logical to add sub-queries yet, so we should not

spark git commit: [SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master 2082a4956 -> dcaa01661 [SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset ## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a

spark git commit: [SPARK-13826][SQL] Revises Dataset ScalaDoc

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master 90a1d8db7 -> 10ef4f3e7 [SPARK-13826][SQL] Revises Dataset ScalaDoc ## What changes were proposed in this pull request? This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups * `groupname basic`:

spark git commit: [SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format

2016-03-19 Thread meng
Repository: spark Updated Branches: refs/heads/master ae6c677c8 -> 3f06eb72c [SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format ## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in

spark git commit: [MINOR][SQL][BUILD] Remove duplicated lines

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master 7eef2463a -> c890c359b [MINOR][SQL][BUILD] Remove duplicated lines ## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ```

spark git commit: [SPARK-13118][SQL] Expression encoding for optional synthetic classes

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master c100d31dd -> 7eef2463a [SPARK-13118][SQL] Expression encoding for optional synthetic classes ## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when

spark git commit: [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 496d2a2b4 -> 9412547e7 [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines ## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after

spark git commit: [SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 4ce2d24e2 -> b90c0206f [SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader # What changes were proposed in this pull request? It's common for many SQL operators to not care about reading `null` values for

spark git commit: [SPARK-11891] Model export/import for RFormula and RFormulaModel

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 828213d4c -> edf8b8775 [SPARK-11891] Model export/import for RFormula and RFormulaModel https://issues.apache.org/jira/browse/SPARK-11891 Author: Xusen Yin Closes #9884 from yinxusen/SPARK-11891. Project:

spark git commit: [SPARK-13948] MiMa check should catch if the visibility changes to private

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master 5faba9fac -> 82066a166 [SPARK-13948] MiMa check should catch if the visibility changes to private MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense:

spark git commit: [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master d9e8f26d0 -> d9670f847 [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the

spark git commit: [SPARK-13930] [SQL] Apply fast serialization on collect limit operator

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 10ef4f3e7 -> 750ed64cd [SPARK-13930] [SQL] Apply fast serialization on collect limit operator ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has

spark git commit: [SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master d9670f847 -> 91984978e [SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx JIRA: https://issues.apache.org/jira/browse/SPARK-13816 ## What changes were proposed in this pull request? Add parameter checks for algorithms in

spark git commit: [MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script.

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 1970d911d -> 2082a4956 [MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script. ## What changes were proposed in this pull request? Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the

spark git commit: [SPARK-13871][SQL] Support for inferring filters from data constraints

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master b90c0206f -> f96997ba2 [SPARK-13871][SQL] Support for inferring filters from data constraints ## What changes were proposed in this pull request? This PR generalizes the `NullFiltering` optimizer rule in catalyst to

spark git commit: [SPARK-13869][SQL] Remove redundant conditions while combining filters

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master f96997ba2 -> 77ba3021c [SPARK-13869][SQL] Remove redundant conditions while combining filters ## What changes were proposed in this pull request? **[I'll link it to the JIRA once ASF JIRA is back online]** This PR modifies the existing

spark git commit: [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings

2016-03-19 Thread lian
Repository: spark Updated Branches: refs/heads/master 0acb32a3f -> 14c7236dc [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of

spark git commit: [SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 91984978e -> 1d1de28a3 [SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the

[5/5] spark git commit: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread wenchen
[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How

[1/2] spark git commit: [SPARK-13923][SQL] Implement SessionCatalog

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 92b70576e -> ca9ef86c8 http://git-wip-us.apache.org/repos/asf/spark/blob/ca9ef86c/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

spark git commit: [MINOR][ML] When trainingSummary is None, it should throw RuntimeException.

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master bb1fda01f -> 7783b6f38 [MINOR][ML] When trainingSummary is None, it should throw RuntimeException. ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How

spark git commit: [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore

2016-03-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 6037ed0a1 -> 6c2d894a2 [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple

spark git commit: [SPARK-13629][ML] Add binary toggle Param to CountVectorizer

2016-03-19 Thread mlnick
Repository: spark Updated Branches: refs/heads/master 204c9dec2 -> 357d82d84 [SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one:

spark git commit: [SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 2e0c5284f -> 238fb485b [SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis ## What changes were proposed in this pull request? As part of testing generating SQL query from

[4/5] spark git commit: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-19 Thread wenchen
http://git-wip-us.apache.org/repos/asf/spark/blob/8ef3399a/core/src/main/scala/org/apache/spark/internal/Logging.scala -- diff --git a/core/src/main/scala/org/apache/spark/internal/Logging.scala

spark git commit: [SPARK-13034] PySpark ml.classification support export/import

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 85c42fda9 -> 27e1f3885 [SPARK-13034] PySpark ml.classification support export/import ## What changes were proposed in this pull request? Add export/import for all estimators and transformers(which have Scala implementation) under

spark git commit: [SPARK-13958] Executor OOM due to unbounded growth of pointer array in…

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 353778216 -> 2e0c5284f [SPARK-13958] Executor OOM due to unbounded growth of pointer array in… ## What changes were proposed in this pull request? This change fixes the executor OOM which was recently introduced in PR

spark git commit: [SPARK-13427][SQL] Support USING clause in JOIN.

2016-03-19 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master 65b75e66e -> 637a78f1d [SPARK-13427][SQL] Support USING clause in JOIN. ## What changes were proposed in this pull request? Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING USING clause can be

spark git commit: Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10"

2016-03-19 Thread yhuai
Repository: spark Updated Branches: refs/heads/master edf8b8775 -> 4c08e2c08 Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10" This reverts commit 3ee7996187bbef008c10681bc4e048c6383f5187. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit:

spark git commit: [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 3ee799618 -> 828213d4c [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable ## What changes were proposed in this pull request? In PySpark wrapper.py JavaWrapper change _java_obj from an unused static

spark git commit: [SPARK-11888][ML] Decision tree persistence in spark.ml

2016-03-19 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 3f06eb72c -> 6fc2b6541 [SPARK-11888][ML] Decision tree persistence in spark.ml ### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel,

spark git commit: [SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegen

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master b39594472 -> 1970d911d [SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegen ## What changes were proposed in this pull request? 500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to

spark git commit: [SPARK-13281][CORE] Switch broadcast of RDD to exception from warning

2016-03-19 Thread srowen
Repository: spark Updated Branches: refs/heads/master 9412547e7 -> 5f6bdf97c [SPARK-13281][CORE] Switch broadcast of RDD to exception from warning ## What changes were proposed in this pull request? In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly,

spark git commit: [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master d1c193a2f -> de1a84e56 [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle

spark git commit: [SPARK-13924][SQL] officially support multi-insert

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master eacd9d8ed -> d9e8f26d0 [SPARK-13924][SQL] officially support multi-insert ## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1

spark git commit: [SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen

2016-03-19 Thread davies
Repository: spark Updated Branches: refs/heads/master 917f4000b -> c100d31dd [SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows

spark git commit: [SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader

2016-03-19 Thread rxin
Repository: spark Updated Branches: refs/heads/master c11ea2e41 -> b39594472 [SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader ## What changes were proposed in this pull request? This is a minor followup on https://github.com/apache/spark/pull/11799 that