[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

damnMeddlingKid Tue, 23 Feb 2016 13:24:18 -0800

GitHub user damnMeddlingKid opened a pull request:

    https://github.com/apache/spark/pull/11330


    [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.

    ## What changes were proposed in this pull request?
    
    This PR adds equality operators to UDT classes so that they can be 
correctly tested for dataType equality during union operations.
    
    This was previously causing `"AnalysisException: u"unresolved operator 
'Union;""` when trying to unionAll two dataframes with UDT columns as below.
    
    ```
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql import types
    
    schema = types.StructType([types.StructField("point", PythonOnlyUDT(), 
True)])
    
    a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
    b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
    
    c = a.unionAll(b)
    ```
    
    
    ## How was the this patch tested?
    
    Tested using two unit tests in sql/test.py and the DataFrameSuite. 
    
    
    
    Additional information here : 
https://issues.apache.org/jira/browse/SPARK-13410
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/damnMeddlingKid/spark udt-union-all

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11330
    
----
commit 6f0f1d9e04a8db47e2f6f8fcfe9dea9de0f633da
Author: Cheng Lian <l...@databricks.com>
Date:   2016-01-25T23:05:05Z

    [SPARK-12934][SQL] Count-min sketch serialization
    
    This PR adds serialization support for `CountMinSketch`.
    
    A version number is added to version the serialized binary format.
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #10893 from liancheng/cms-serialization.

commit be375fcbd200fb0e210b8edcfceb5a1bcdbba94b
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-01-26T00:23:59Z

    [SPARK-12879] [SQL] improve the unsafe row writing framework
    
    As we begin to use unsafe row writing framework(`BufferHolder` and 
`UnsafeRowWriter`) in more and more places(`UnsafeProjection`, 
`UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add 
more doc to it and make it easier to use.
    
    This PR abstract the technique used in `UnsafeRowParquetRecordReader`: 
avoid unnecessary operatition as more as possible. For example, do not always 
point the row to the buffer at the end, we only need to update the size of row. 
If all fields are of primitive type, we can even save the row size updating. 
Then we can apply this technique to more places easily.
    
    a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this 
PR:
    **old version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative 
Rate
    
-------------------------------------------------------------------------------
    single long                             2616.04           102.61         
1.00 X
    single nullable long                    3032.54            88.52         
0.86 X
    primitive types                         9121.05            29.43         
0.29 X
    nullable primitive types               12410.60            21.63         
0.21 X
    ```
    
    **new version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative 
Rate
    
-------------------------------------------------------------------------------
    single long                             1533.34           175.07         
1.00 X
    single nullable long                    2306.73           116.37         
0.66 X
    primitive types                         8403.93            31.94         
0.18 X
    nullable primitive types               12448.39            21.56         
0.12 X
    ```
    
    For single non-nullable long(the best case), we can have about 1.7x speed 
up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's 
not such a boost as the saved operations only take a little proportion of the 
whole process.  The benchmark code is included in this PR.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #10809 from cloud-fan/unsafe-projection.

commit 109061f7ad27225669cbe609ec38756b31d4e1b9
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-01-26T01:58:11Z

    [SPARK-12936][SQL] Initial bloom filter implementation
    
    This PR adds an initial implementation of bloom filter in the newly added 
sketch module.  The implementation is based on the [`BloomFilter` class in 
guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).
    
    Some difference from the design doc:
    
    * expose `bitSize` instead of `sizeInBytes` to user.
    * always need the `expectedInsertions` parameter when create bloom filter.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #10883 from cloud-fan/bloom-filter.

commit fdcc3512f7b45e5b067fc26cb05146f79c4a5177
Author: tedyu <yuzhih...@gmail.com>
Date:   2016-01-26T02:23:47Z

    [SPARK-12934] use try-with-resources for streams
    
    liancheng please take a look
    
    Author: tedyu <yuzhih...@gmail.com>
    
    Closes #10906 from tedyu/master.

commit b66afdeb5253913d916dcf159aaed4ffdc15fd4b
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-01-26T06:38:31Z

    [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer
    
    Add Python API for ml.feature.QuantileDiscretizer.
    
    One open question: Do we want to do this stuff to re-use the java model, 
create a new model, or use a different wrapper around the java model.
    cc brkyvz & mengxr
    
    Author: Holden Karau <hol...@us.ibm.com>
    
    Closes #10085 from 
holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.

commit ae47ba718a280fc12720a71b981c38dbe647f35b
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2016-01-26T06:41:52Z

    [SPARK-12834] Change ser/de of JavaArray and JavaList
    
    https://issues.apache.org/jira/browse/SPARK-12834
    
    We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in 
`PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. 
However, there is no need to transform them in such an inefficient way. Instead 
of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or 
`list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I 
said in https://issues.apache.org/jira/browse/SPARK-12780
    
    Author: Xusen Yin <yinxu...@gmail.com>
    
    Closes #10772 from yinxusen/SPARK-12834.

commit 27c910f7f29087d1ac216d4933d641d6515fd6ad
Author: Xiangrui Meng <m...@databricks.com>
Date:   2016-01-26T06:53:34Z

    [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in 
PySpark for now
    
    I saw several failures from recent PR builds, e.g., 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull.
 This PR marks the test as ignored and we will fix the flakyness in SPARK-10086.
    
    gliptak Do you know why the test failure didn't show up in the Jenkins 
"Test Result"?
    
    cc: jkbradley
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #10909 from mengxr/SPARK-10086.

commit d54cfed5a6953a9ce2b9de2f31ee2d673cb5cc62
Author: Reynold Xin <r...@databricks.com>
Date:   2016-01-26T08:51:08Z

    [SQL][MINOR] A few minor tweaks to CSV reader.
    
    This pull request simply fixes a few minor coding style issues in csv, as I 
was reviewing the change post-hoc.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #10919 from rxin/csv-minor.

commit 6743de3a98e3f0d0e6064ca1872fa88c3aeaa143
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-01-26T08:53:05Z

    [SPARK-12937][SQL] bloom filter serialization
    
    This PR adds serialization support for BloomFilter.
    
    A version number is added to version the serialized binary format.
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #10920 from cloud-fan/bloom-filter.

commit 5936bf9fa85ccf7f0216145356140161c2801682
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2016-01-26T11:36:00Z

    [SPARK-12961][CORE] Prevent snappy-java memory leak
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-12961
    
    To prevent memory leak in snappy-java, just call the method once and cache 
the result. After the library releases new version, we can remove this object.
    
    JoshRosen
    
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #10875 from viirya/prevent-snappy-memory-leak.

commit 649e9d0f5b2d5fc13f2dd5be675331510525927f
Author: Sean Owen <so...@cloudera.com>
Date:   2016-01-26T11:55:28Z

    [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is 
inconsistent with Scala's Iterator->Iterator
    
    Fix Java function API methods for flatMap and mapPartitions to require 
producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a 
function producing TraversableOnce only, not Traversable.
    
    CC rxin pwendell for API change; tdas since it also touches streaming.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #10413 from srowen/SPARK-3369.

commit ae0309a8812a4fade3a0ea67d8986ca870aeb9eb
Author: zhuol <zh...@yahoo-inc.com>
Date:   2016-01-26T15:40:02Z

    [SPARK-10911] Executors should System.exit on clean shutdown.
    
    Call system.exit explicitly to make sure non-daemon user threads terminate. 
Without this, user applications might live forever if the cluster manager does 
not appropriately kill them. E.g., YARN had this bug: HADOOP-12441.
    
    Author: zhuol <zh...@yahoo-inc.com>
    
    Closes #9946 from zhuoliu/10911.

commit 08c781ca672820be9ba32838bbe40d2643c4bde4
Author: Sameer Agarwal <sam...@databricks.com>
Date:   2016-01-26T15:50:37Z

    [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive 
metadata format
    
    This PR adds a new table option (`skip_hive_metadata`) that'd allow the 
user to skip storing the table metadata in hive metadata format. While this 
could be useful in general, the specific use-case for this change is that Hive 
doesn't handle wide schemas well (see 
https://issues.apache.org/jira/browse/SPARK-12682 and 
https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such 
tables from being queried in SparkSQL.
    
    Author: Sameer Agarwal <sam...@databricks.com>
    
    Closes #10826 from sameeragarwal/skip-hive-metadata.

commit cbd507d69cea24adfb335d8fe26ab5a13c053ffc
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-01-26T19:31:54Z

    [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying 
instructions for streaming-akka project
    
    Since `actorStream` is an external project, we should add the linking and 
deploying instructions for it.
    
    A follow up PR of #10744
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #10856 from zsxwing/akka-link-instruction.

commit 8beab68152348c44cf2f89850f792f164b06470d
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2016-01-26T19:56:46Z

    [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector
    
    https://issues.apache.org/jira/browse/SPARK-11923
    
    Author: Xusen Yin <yinxu...@gmail.com>
    
    Closes #10186 from yinxusen/SPARK-11923.

commit fbf7623d49525e3aa6b08f482afd7ee8118d80cb
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2016-01-26T21:18:01Z

    [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer 
other than its parent class
    
    https://issues.apache.org/jira/browse/SPARK-12952
    
    Author: Xusen Yin <yinxu...@gmail.com>
    
    Closes #10863 from yinxusen/SPARK-12952.

commit ee74498de372b16fe6350e3617e9e6ec87c6ae7b
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-01-26T22:20:11Z

    [SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in 
dev/run-tests
    
    This patch improves our `dev/run-tests` script to test modules in a 
topologically-sorted order based on modules' dependencies.  This will help to 
ensure that bugs in upstream projects are not misattributed to downstream 
projects because those projects' tests were the first ones to exhibit the 
failure
    
    Topological sorting is also useful for shortening the feedback loop when 
testing pull requests: if I make a change in SQL then the SQL tests should run 
before MLlib, not after.
    
    In addition, this patch also updates our test module definitions to split 
`sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be 
skipped when changing only `hive/` files.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #10885 from JoshRosen/SPARK-8725.

commit 83507fea9f45c336d73dd4795b8cb37bcd63e31d
Author: Cheng Lian <l...@databricks.com>
Date:   2016-01-26T22:29:29Z

    [SQL] Minor Scaladoc format fix
    
    Otherwise the `^` character is always marked as error in IntelliJ since it 
represents an unclosed superscript markup tag.
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #10926 from liancheng/agg-doc-fix.

commit 19fdb21afbf0eae4483cf6d4ef32daffd1994b89
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-01-26T22:58:39Z

    [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark
    
    environment variable ADD_FILES is created for adding python files on spark 
context to be distributed to executors (SPARK-865), this is deprecated now. 
User are encouraged to use --py-files for adding python files.
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #10913 from zjffdu/SPARK-12993.

commit eb917291ca1a2d68ca0639cb4b1464a546603eba
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-01-26T23:53:48Z

    [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code
    
    The current python ml params require cut-and-pasting the param setup and 
description between the class & ```__init__``` methods. Remove this possible 
case of errors & simplify use of custom params by adding a 
```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut 
and pasting at different indentation levels urgh).
    
    Author: Holden Karau <hol...@us.ibm.com>
    
    Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.

commit 22662b241629b56205719ede2f801a476e10a3cd
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-01-27T01:24:40Z

    [SPARK-12614][CORE] Don't throw non fatal exception from ask
    
    Right now RpcEndpointRef.ask may throw exception in some corner cases, such 
as calling ask after stopping RpcEnv. It's better to avoid throwing exception 
from RpcEndpointRef.ask. We can send the exception to the future for `ask`.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #10568 from zsxwing/send-ask-fail.

commit 1dac964c1b996d38c65818414fc8401961a1de8a
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-01-27T01:31:19Z

    [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation andâ¦
    
    â¦ Add LibSVMOutputWriter
    
    The behavior of LibSVMRelation is not changed except adding 
LibSVMOutputWriter
    * Partition is still not supported
    * Multiple input paths is not supported
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #9595 from zjffdu/SPARK-11622.

commit 555127387accdd7c1cf236912941822ba8af0a52
Author: Nong Li <n...@databricks.com>
Date:   2016-01-27T01:34:01Z

    [SPARK-12854][SQL] Implement complex types support in ColumnarBatch
    
    This patch adds support for complex types for ColumnarBatch. ColumnarBatch 
supports structs
    and arrays. There is a simple mapping between the richer catalyst types to 
these two. Strings
    are treated as an array of bytes.
    
    ColumnarBatch will contain a column for each node of the schema. 
Non-complex schemas consists
    of just leaf nodes. Structs represent an internal node with one child for 
each field. Arrays
    are internal nodes with one child. Structs just contain nullability. Arrays 
contain offsets
    and lengths into the child array. This structure is able to handle 
arbitrary nesting. It has
    the key property that we maintain columnar throughout and that primitive 
types are only stored
    in the leaf nodes and contiguous across rows. For example, if the schema is
    ```
    array<array<int>>
    ```
    There are three columns in the schema. The internal nodes each have one 
children. The leaf node contains all the int data stored consecutively.
    
    As part of this, this patch adds append APIs in addition to the Put APIs 
(e.g. putLong(rowid, v)
    vs appendLong(v)). These APIs are necessary when the batch contains 
variable length elements.
    The vectors are not fixed length and will grow as necessary. This should 
make the usage a lot
    simpler for the writer.
    
    Author: Nong Li <n...@databricks.com>
    
    Closes #10820 from nongli/spark-12854.

commit b72611f20a03c790b6fd341b6ffdb3b5437609ee
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-01-27T01:59:05Z

    [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be 
regularized
    
    The intercept in Logistic Regression represents a prior on categories which 
should not be regularized. In MLlib, the regularization is handled through 
Updater, and the Updater penalizes all the components without excluding the 
intercept which resulting poor training accuracy with regularization.
    The new implementation in ML framework handles this properly, and we should 
call the implementation in ML from MLlib since majority of users are still 
using MLlib api.
    Note that both of them are doing feature scalings to improve the 
convergence, and the only difference is ML version doesn't regularize the 
intercept. As a result, when lambda is zero, they will converge to the same 
solution.
    
    Previously partially reviewed at 
https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for 
dbtsai to review.
    
    Author: Holden Karau <hol...@us.ibm.com>
    Author: Holden Karau <hol...@pigscanfly.ca>
    
    Closes #10788 from 
holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.

commit e7f9199e709c46a6b5ad6b03c9ecf12cc19e3a41
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-27T03:29:47Z

    [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR
    
    Add ```covar_samp``` and ```covar_pop``` for SparkR.
    Should we also provide ```cov``` alias for ```covar_samp```? There is 
```cov``` implementation at stats.R which masks ```stats::cov``` already, but 
may bring to breaking API change.
    
    cc sun-rui felixcheung shivaram
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #10829 from yanboliang/spark-12903.

commit ce38a35b764397fcf561ac81de6da96579f5c13e
Author: Cheng Lian <l...@databricks.com>
Date:   2016-01-27T04:12:34Z

    [SPARK-12935][SQL] DataFrame API for Count-Min Sketch
    
    This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This 
version resorts to `RDD.aggregate` for building the sketch. A more performant 
UDAF version can be built in future follow-up PRs.
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #10911 from liancheng/cms-df-api.

commit 58f5d8c1da6feeb598aa5f74ffe1593d4839d11d
Author: Cheng Lian <l...@databricks.com>
Date:   2016-01-27T04:30:13Z

    [SPARK-12728][SQL] Integrates SQL generation with native view
    
    This PR is a follow-up of PR #10541. It integrates the newly introduced SQL 
generation feature with native view to make native view canonical.
    
    In this PR, a new SQL option `spark.sql.nativeView.canonical` is added.  
When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to 
handle `CREATE VIEW` DDL statements using SQL query strings generated from view 
definition logical plans. If we failed to map the plan to SQL, we fallback to 
the original native view approach.
    
    One important issue this PR fixes is that, now we can use CTE when defining 
a view.  Originally, when native view is turned on, we wrap the view definition 
text with an extra `SELECT`.  However, HiveQL parser doesn't allow CTE 
appearing as a subquery.  Namely, something like this is disallowed:
    
    ```sql
    SELECT n
    FROM (
      WITH w AS (SELECT 1 AS n)
      SELECT * FROM w
    ) v
    ```
    
    This PR fixes this issue because the extra `SELECT` is no longer needed 
(also, CTE expressions are inlined as subqueries during analysis phase, thus 
there won't be CTE expressions in the generated SQL query string).
    
    Author: Cheng Lian <l...@databricks.com>
    Author: Yin Huai <yh...@databricks.com>
    
    Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.

commit bae3c9a4eb0c320999e5dbafd62692c12823e07d
Author: Nishkam Ravi <nishkamr...@gmail.com>
Date:   2016-01-27T05:14:39Z

    [SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext 
shutdown
    
    If there's an RPC issue while sparkContext is alive but stopped (which 
would happen only when executing SparkContext.stop), log a warning instead. 
This is a common occurrence.
    
    vanzin
    
    Author: Nishkam Ravi <nishkamr...@gmail.com>
    Author: nishkamravi2 <nishkamr...@gmail.com>
    
    Closes #10881 from nishkamravi2/master_netty.

commit 4db255c7aa756daa224d61905db745b6bccc9173
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2016-01-27T05:16:56Z

    [SPARK-12780] Inconsistency returning value of ML python models' properties
    
    https://issues.apache.org/jira/browse/SPARK-12780
    
    Author: Xusen Yin <yinxu...@gmail.com>
    
    Closes #10724 from yinxusen/SPARK-12780.

commit 90b0e562406a8bac529e190472e7f5da4030bf5c
Author: BenFradet <benjamin.fra...@gmail.com>
Date:   2016-01-27T09:27:11Z

    [SPARK-12983][CORE][DOC] Correct metrics.properties.template
    
    There are some typos or plain unintelligible sentences in the metrics 
template.
    
    Author: BenFradet <benjamin.fra...@gmail.com>
    
    Closes #10902 from BenFradet/SPARK-12983.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

Reply via email to