[GitHub] spark pull request #17098: Branch 2.1

forceto Tue, 28 Feb 2017 04:38:45 -0800

GitHub user forceto reopened a pull request:

    https://github.com/apache/spark/pull/17098


    Branch 2.1

    helloï¼I am wondering how to look for your latest thesisãIn the Web of 
Science, what key words should be typed. Is there a list about your thesis 
about Spark?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17098.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17098
    
----
commit d20e0d6b8919eccaab9ae7db94ba80fdfac03c9d
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-12-06T21:05:22Z

    [SPARK-18671][SS][TEST] Added tests to ensure stability of that all 
Structured Streaming log formats
    
    ## What changes were proposed in this pull request?
    
    To be able to restart StreamingQueries across Spark version, we have 
already made the logs (offset log, file source log, file sink log) use json. We 
should added tests with actual json files in the Spark such that any 
incompatible changes in reading the logs is immediately caught. This PR add 
tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
    
    ## How was this patch tested?
    new unit tests
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #16128 from tdas/SPARK-18671.
    
    (cherry picked from commit 1ef6b296d7cd2d93cdfd5f54940842d6bb915ce0)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 65f5331a7f3a9de8ca7382b2a14db6c0670c4015
Author: Shuai Lin <linshuai2...@gmail.com>
Date:   2016-12-06T22:09:27Z

    [SPARK-18652][PYTHON] Include the example data and third-party licenses in 
pyspark package.
    
    ## What changes were proposed in this pull request?
    
    Since we already include the python examples in the pyspark package, we 
should include the example data with it as well.
    
    We should also include the third-party licences since we distribute their 
jars with the pyspark package.
    
    ## How was this patch tested?
    
    Manually tested with python2.7 and python3.4
    ```sh
    $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean 
package
    $ cd python
    $ python setup.py sdist
    $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
    
    $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
    graphx
    mllib
    streaming
    
    $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
    600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
    
    $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
    LICENSE-AnchorJS.txt
    LICENSE-DPark.txt
    LICENSE-Mockito.txt
    LICENSE-SnapTree.txt
    LICENSE-antlr.txt
    ```
    
    Author: Shuai Lin <linshuai2...@gmail.com>
    
    Closes #16082 from lins05/include-data-in-pyspark-dist.
    
    (cherry picked from commit bd9a4a5ac3abcc48131d1249df55e7d68266343a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 9b5bc2a6aeb9580fc2dde3f37a77b4d1fbc6299e
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-12-07T01:04:26Z

    [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as 
formatted string instead of millis
    
    ## What changes were proposed in this pull request?
    
    Easier to read while debugging as a formatted string (in ISO8601 format) 
than in millis
    
    ## How was this patch tested?
    Updated unit tests
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #16166 from tdas/SPARK-18734.
    
    (cherry picked from commit 539bb3cf9573be5cd86e7e6502523ce89c0de170)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 3750c6e9b580be0f2e25f691a1fd582f1b7e430a
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-12-07T05:51:38Z

    [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven
    
    ## What changes were proposed in this pull request?
    
    Maven compilation seem to not allow resource is sql/test to be easily 
referred to in kafka-0-10-sql tests. So moved the 
kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql 
test resources.
    
    ## How was this patch tested?
    
    Manually ran maven test
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #16183 from tdas/SPARK-18671-1.
    
    (cherry picked from commit 5c6bcdbda4dd23bbd112a7395cd9d1cfd04cf4bb)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 340e9aea4853805c42b8739004d93efe8fe16ba4
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-12-07T08:31:11Z

    [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.
    
    ## What changes were proposed in this pull request?
    Several cleanup and improvements for ```spark.logit```:
    * ```summary``` should return coefficients matrix, and should output labels 
for each class if the model is multinomial logistic regression model.
    * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since 
most of them are DataFrame which are less important for R users. Meanwhile, 
these metrics ignore instance weights (setting all to 1.0) which will be 
changed in later Spark version. In case it will introduce breaking changes, we 
do not expose them currently.
    * SparkR test improvement: comparing the training result with native R 
glmnet.
    * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's 
an expert Param(related with Spark architecture and job execution) that would 
be used rarely by R users.
    
    ## How was this patch tested?
    Unit tests.
    
    The ```summary``` output after this change:
    multinomial logistic regression:
    ```
    > df <- suppressWarnings(createDataFrame(iris))
    > model <- spark.logit(df, Species ~ ., regParam = 0.5)
    > summary(model)
    $coefficients
                 versicolor  virginica   setosa
    (Intercept)  1.514031    -2.609108   1.095077
    Sepal_Length 0.02511006  0.2649821   -0.2900921
    Sepal_Width  -0.5291215  -0.02016446 0.549286
    Petal_Length 0.03647411  0.1544119   -0.190886
    Petal_Width  0.000236092 0.4195804   -0.4198165
    ```
    binomial logistic regression:
    ```
    > df <- suppressWarnings(createDataFrame(iris))
    > training <- df[df$Species %in% c("versicolor", "virginica"), ]
    > model <- spark.logit(training, Species ~ ., regParam = 0.5)
    > summary(model)
    $coefficients
                 Estimate
    (Intercept)  -6.053815
    Sepal_Length 0.2449379
    Sepal_Width  0.1648321
    Petal_Length 0.4730718
    Petal_Width  1.031947
    ```
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #16117 from yanboliang/spark-18686.
    
    (cherry picked from commit 90b59d1bf262b41c3a5f780697f504030f9d079c)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 99c293eeaa9733fc424404d04a9671e9525a1e36
Author: actuaryzhang <actuaryzhan...@gmail.com>
Date:   2016-12-07T08:37:25Z

    [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization
    
    Poisson GLM fails for many standard data sets (see example in test or 
JIRA). The issue is incorrect initialization leading to almost zero probability 
and weights. Specifically, the mean is initialized as the response, which could 
be zero. Applying the log link results in very negative numbers (protected 
against -Inf), which again leads to close to zero probability and weights in 
the weighted least squares. Fix and test are included in the commits.
    
    ## What changes were proposed in this pull request?
    Update initialization in Poisson GLM
    
    ## How was this patch tested?
    Add test in GeneralizedLinearRegressionSuite
    
    srowen sethah yanboliang HyukjinKwon mengxr
    
    Author: actuaryzhang <actuaryzhan...@gmail.com>
    
    Closes #16131 from actuaryzhang/master.
    
    (cherry picked from commit b8280271396eb74638da6546d76bbb2d06c7011b)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 51754d6df703c02ecb23ec1779889602ff8fb038
Author: Sean Owen <so...@cloudera.com>
Date:   2016-12-07T09:34:45Z

    [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils
    
    ## What changes were proposed in this pull request?
    
    Fix reservoir sampling bias for small k. An off-by-one error meant that the 
probability of replacement was slightly too high -- k/(l-1) after l element 
instead of k/l, which matters for small k.
    
    ## How was this patch tested?
    
    Existing test plus new test case.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #16129 from srowen/SPARK-18678.
    
    (cherry picked from commit 79f5f281bb69cb2de9f64006180abd753e8ae427)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 4432a2a8386f951775957f352e4ba223c6ce4fa3
Author: Jie Xiong <jiexi...@fb.com>
Date:   2016-12-07T12:33:30Z

    [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in 
BytesToBytesMap
    
    ## What changes were proposed in this pull request?
    
    BytesToBytesMap currently does not release the in-memory storage (the 
longArray variable) after it spills to disk. This is typically not a problem 
during aggregation because the longArray should be much smaller than the pages, 
and because we grow the longArray at a conservative rate.
    
    However this can lead to an OOM when an already running task is allocated 
more than its fair share, this can happen because of a scheduling delay. In 
this case the longArray can grow beyond the fair share of memory for the task. 
This becomes problematic when the task spills and the long array is not freed, 
that causes subsequent memory allocation requests to be denied by the memory 
manager resulting in an OOM.
    
    This PR fixes this issuing by freeing the longArray when the 
BytesToBytesMap spills.
    
    ## How was this patch tested?
    
    Existing tests and tested on realworld workloads.
    
    Author: Jie Xiong <jiexi...@fb.com>
    Author: jiexiong <jiexi...@gmail.com>
    
    Closes #15722 from jiexiong/jie_oom_fix.
    
    (cherry picked from commit c496d03b5289f7c604661a12af86f6accddcf125)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit 5dbcd4fcfbc14ba8c17e1cb364ca45b99aa90708
Author: Andrew Ray <ray.and...@gmail.com>
Date:   2016-12-07T12:44:14Z

    [SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy 
column is not attribute
    
    ## What changes were proposed in this pull request?
    
    Fixes AnalysisException for pivot queries that have group by columns that 
are expressions and not attributes by substituting the expressions output 
attribute in the second aggregation and final projection.
    
    ## How was this patch tested?
    
    existing and additional unit tests
    
    Author: Andrew Ray <ray.and...@gmail.com>
    
    Closes #16177 from aray/SPARK-17760.
    
    (cherry picked from commit f1fca81b165c5a673f7d86b268e04ea42a6c267e)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit acb6ac5da7a5694cc3270772c6d68933b7d761dc
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-12-07T18:30:05Z

    [SPARK-18764][CORE] Add a warning log when skipping a corrupted file
    
    ## What changes were proposed in this pull request?
    
    It's better to add a warning log when skipping a corrupted file. It will be 
helpful when we want to finish the job first, then find them in the log and fix 
these files.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #16192 from zsxwing/SPARK-18764.
    
    (cherry picked from commit dbf3e298a1a35c0243f087814ddf88034ff96d66)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 76e1f1651f5a7207c9c66686616709b62b798fa3
Author: sarutak <saru...@oss.nttdata.co.jp>
Date:   2016-12-07T19:41:23Z

    [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040
    
    ## What changes were proposed in this pull request?
    
    When SSL is enabled, the Spark shell shows:
    ```
    Spark context Web UI available at https://192.168.99.1:4040
    ```
    This is wrong because 4040 is http, not https. It redirects to the https 
port.
    More importantly, this introduces several broken links in the UI. For 
example, in the master UI, the worker link is https:8081 instead of http:8081 
or https:8481.
    
    CC: mengxr liancheng
    
    I manually tested accessing by accessing MasterPage, WorkerPage and 
HistoryServer with SSL enabled.
    
    Author: sarutak <saru...@oss.nttdata.co.jp>
    
    Closes #16190 from sarutak/SPARK-18761.
    
    (cherry picked from commit bb94f61a7ac97bf904ec0e8d5a4ab69a4142443f)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit e9b3afac9ce5ea4bffb8201a58856598c521a3a9
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-12-07T21:47:44Z

    [SPARK-18588][TESTS] Fix flaky test: 
KafkaSourceStressForDontFailOnDataLossSuite
    
    ## What changes were proposed in this pull request?
    
    Fixed the following failures:
    
    ```
    org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed 
to eventually never returned normally. Attempted 3745 times over 
1.0000790851666665 minutes. Last failure message: assertion failed: 
failOnDataLoss-0 not deleted after timeout.
    ```
    
    ```
    sbt.ForkMain$ForkError: 
org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 
terminated with exception: null
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
    Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
        at java.util.ArrayList.addAll(ArrayList.java:577)
        at 
org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
        at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
        at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
        at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
        at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
        at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
        at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
        at
    ...
    ```
    
    ## How was this patch tested?
    
    Tested in #16048 by running many times.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #16109 from zsxwing/fix-kafka-flaky-test.
    
    (cherry picked from commit edc87e18922b98be47c298cdc3daa2b049a737e9)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 1c6419718aadf0bdc200f9b328242062a07f2277
Author: Michael Armbrust <mich...@databricks.com>
Date:   2016-12-07T23:36:29Z

    [SPARK-18754][SS] Rename recentProgresses to recentProgress
    
    Based on an informal survey, users find this option easier to understand / 
remember.
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #16182 from marmbrus/renameRecentProgress.
    
    (cherry picked from commit 70b2bf717d367d598c5a238d569d62c777e63fde)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 839c2eb9723ba51baf6022fea8c29caecf7c0612
Author: wm...@hotmail.com <wm...@hotmail.com>
Date:   2016-12-08T02:12:49Z

    [SPARK-18633][ML][EXAMPLE] Add multiclass logistic regression summary 
python example and document
    
    ## What changes were proposed in this pull request?
    Logistic Regression summary is added in Python API. We need to add example 
and document for summary.
    
    The newly added example is consistent with Scala and Java examples.
    
    ## How was this patch tested?
    
    Manually tests: Run the example with spark-submit; copy & paste code into 
pyspark; build document and check the document.
    
    Author: wm...@hotmail.com <wm...@hotmail.com>
    
    Closes #16064 from wangmiao1981/py.
    
    (cherry picked from commit aad11209eb4db585f991ba09d08d90576f315bb4)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 617ce3ba765e13e354eaa9b7e13851aef40c9ceb
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-12-08T03:23:27Z

    [SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery 
should be sent only to the listeners in the same session as the query
    
    ## What changes were proposed in this pull request?
    
    Listeners added with `sparkSession.streams.addListener(l)` are added to a 
SparkSession. So events only from queries in the same session as a listener 
should be posted to the listener. Currently, all the events gets rerouted 
through the Spark's main listener bus, that is,
    - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries 
associated with the same session as the bus posts events to it.
    - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as 
a SparkEvent.
    - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus 
getting back the posted event in a different thread.
    - The received is posted to the registered listeners.
    
    The problem is that *all StreamingQueryListenerBuses in all sessions* gets 
the events and posts them to their listeners. This is wrong.
    
    In this PR, I solve it by making StreamingQueryListenerBus track active 
queries (by their runIds) when a query posts the QueryStarted event to the bus. 
This allows the rerouted events to be filtered using the tracked queries.
    
    Note that this list needs to be maintained separately
    from the `StreamingQueryManager.activeQueries` because a terminated query 
is cleared from
    `StreamingQueryManager.activeQueries` as soon as it is stopped, but the 
this ListenerBus must
    clear a query only after the termination event of that query has been 
posted lazily, much after the query has been terminated.
    
    Credit goes to zsxwing for coming up with the initial idea.
    
    ## How was this patch tested?
    Updated test harness code to use the correct session, and added new unit 
test.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #16186 from tdas/SPARK-18758.
    
    (cherry picked from commit 9ab725eabbb4ad515a663b395bd2f91bb5853a23)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit ab865cfd9dc87154e7d4fc5d09168868c88db6b0
Author: sethah <seth.hendrickso...@gmail.com>
Date:   2016-12-08T03:41:32Z

    [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 
and elastic-net
    
    ## What changes were proposed in this pull request?
    
    WeightedLeastSquares now supports L1 and elastic net penalties and has an 
additional solver option: QuasiNewton. The docs are updated to reflect this 
change.
    
    ## How was this patch tested?
    
    Docs only. Generated documentation to make sure Latex looks ok.
    
    Author: sethah <seth.hendrickso...@gmail.com>
    
    Closes #16139 from sethah/SPARK-18705.
    
    (cherry picked from commit 82253617f5b3cdbd418c48f94e748651ee80077e)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 1c3f1da82356426b6b550fee67e66dc82eaf1c85
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-12-08T04:23:28Z

    [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1
    
    ## What changes were proposed in this pull request?
    Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
    * Remove ```probabilityCol``` from the argument list of ```spark.logit``` 
and ```spark.randomForest```. Since it was used when making prediction and 
should be an argument of ```predict```, and we will work on this at 
[SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next 
release cycle.
    * Fix ```spark.als``` params to make it consistent with MLlib.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #16169 from yanboliang/spark-18326.
    
    (cherry picked from commit 97255497d885f0f8ccfc808e868bc8aa5e4d1063)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 080717497365b83bc202ab16812ced93eb1ea7bd
Author: Patrick Wendell <pwend...@gmail.com>
Date:   2016-12-08T06:29:49Z

    Preparing Spark release v2.1.0-rc2

commit 48aa6775d6b54ccecdbe2287ae75d99c00b02d18
Author: Patrick Wendell <pwend...@gmail.com>
Date:   2016-12-08T06:29:55Z

    Preparing development version 2.1.1-SNAPSHOT

commit 9095c152e7fedf469dcc4887f5b6a1882cd74c28
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-12-08T14:19:38Z

    [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide
    
    ## What changes were proposed in this pull request?
    * Add all R examples for ML wrappers which were added during 2.1 release 
cycle.
    * Split the whole ```ml.R``` example file into individual example for each 
algorithm, which will be convenient for users to rerun them.
    * Add corresponding examples to ML user guide.
    * Update ML section of SparkR user guide.
    
    Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR 
examples may different from them, since R users may use the algorithms in a 
different way, for example, using R ```formula``` to specify ```featuresCol``` 
and ```labelCol```.
    
    ## How was this patch tested?
    Run all examples manually.
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #16148 from yanboliang/spark-18325.
    
    (cherry picked from commit 9bf8f3cd4f62f921c32fb50b8abf49576a80874f)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 726217eb7f783e10571a043546694b5b3c90ac77
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2016-12-08T15:22:18Z

    [SPARK-18667][PYSPARK][SQL] Change the way to group row in 
BatchEvalPythonExec so input_file_name function can work with UDF in pyspark
    
    ## What changes were proposed in this pull request?
    
    `input_file_name` doesn't return filename when working with UDF in PySpark. 
An example shows the problem:
    
        from pyspark.sql.functions import *
        from pyspark.sql.types import *
    
        def filename(path):
            return path
    
        sourceFile = udf(filename, StringType())
        spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
    
        +---------------------------+
        |filename(input_file_name())|
        +---------------------------+
        |                           |
        +---------------------------+
    
    The cause of this issue is, we group rows in `BatchEvalPythonExec` for 
batching processing of PythonUDF. Currently we group rows first and then 
evaluate expressions on the rows. If the data is less than the required number 
of rows for a group, the iterator will be consumed to the end before the 
evaluation. However, once the iterator reaches the end, we will unset input 
filename. So the input_file_name expression can't return correct filename.
    
    This patch fixes the approach to group the batch of rows. We evaluate the 
expression first and then group evaluated results to batch.
    
    ## How was this patch tested?
    
    Added unit test to PySpark.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Liang-Chi Hsieh <vii...@gmail.com>
    
    Closes #16115 from viirya/fix-py-udf-input-filename.
    
    (cherry picked from commit 6a5a7254dc37952505989e9e580a14543adb730c)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit e0173f14e3ea28d83c1c46bf97f7d3755960a8fc
Author: Andrew Ray <ray.and...@gmail.com>
Date:   2016-12-08T19:08:12Z

    [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of 
records
    
    ## What changes were proposed in this pull request?
    
    Fixes a bug in the python implementation of rdd cartesian product related 
to batching that showed up in repeated cartesian products with seemingly random 
results. The root cause being multiple iterators pulling from the same stream 
in the wrong order because of logic that ignored batching.
    
    `CartesianDeserializer` and `PairDeserializer` were changed to implement 
`_load_stream_without_unbatching` and borrow the one line implementation of 
`load_stream` from `BatchedSerializer`. The default implementation of 
`_load_stream_without_unbatching` was changed to give consistent results 
(always an iterable) so that it could be used without additional checks.
    
    `PairDeserializer` no longer extends `CartesianDeserializer` as it was not 
really proper. If wanted a new common super class could be added.
    
    Both `CartesianDeserializer` and `PairDeserializer` now only extend 
`Serializer` (which has no `dump_stream` implementation) since they are only 
meant for *de*serialization.
    
    ## How was this patch tested?
    
    Additional unit tests (sourced from #14248) plus one for testing a 
cartesian with zip.
    
    Author: Andrew Ray <ray.and...@gmail.com>
    
    Closes #16121 from aray/fix-cartesian.
    
    (cherry picked from commit 3c68944b229aaaeeaee3efcbae3e3be9a2914855)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit d69df9073274f7ab3a3598bb182a3233fd7775cd
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-12-08T19:29:31Z

    [SPARK-18590][SPARKR] build R source package when making distribution
    
    This PR has 2 key changes. One, we are building source package (aka bundle 
package) for SparkR which could be released on CRAN. Two, we should include in 
the official Spark binary distributions SparkR installed from this source 
package instead (which would have help/vignettes rds needed for those to work 
when the SparkR package is loaded in R, whereas earlier approach with devtools 
does not)
    
    But, because of various differences in how R performs different tasks, this 
PR is a fair bit more complicated. More details below.
    
    This PR also includes a few minor fixes.
    
    These are the additional steps in make-distribution; please see 
[here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's 
going to a CRAN release, which is now run during make-distribution.sh.
    1. package needs to be installed because the first code block in vignettes 
is `library(SparkR)` without lib path
    2. `R CMD build` will build vignettes (this process runs Spark/SparkR code 
and captures outputs into pdf documentation)
    3. `R CMD check` on the source package will install package and build 
vignettes again (this time from source packaged) - this is a key step required 
to release R package on CRAN
     (will skip tests here but tests will need to pass for CRAN release process 
to success - ideally, during release signoff we should install from the R 
source package and run tests)
    4. `R CMD Install` on the source package (this is the only way to generate 
doc/vignettes rds files correctly, not in step # 1)
     (the output of this step is what we package into Spark dist and sparkr.zip)
    
    Alternatively,
       R CMD build should already be installing the package in a temp directory 
though it might just be finding this location and set it to lib.loc parameter; 
another approach is perhaps we could try calling `R CMD INSTALL --build pkg` 
instead.
     But in any case, despite installing the package multiple times this is 
relatively fast.
    Building vignettes takes a while though.
    
    Manually, CI.
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #16014 from felixcheung/rdist.
    
    (cherry picked from commit c3d3a9d0e85b834abef87069e4edd27db87fc607)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit a035644182646a2160ac16ecd6c7f4d98be2caad
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-12-08T19:54:04Z

    [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in 
Utils.tryOrStopSparkContext
    
    ## What changes were proposed in this pull request?
    
    When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the 
following three places), it will cause deadlock because the `stop` method needs 
to wait for the thread running `stop` to exit.
    
    - ContextCleaner.keepCleaning
    - LiveListenerBus.listenerThread.run
    - TaskSchedulerImpl.start
    
    This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the 
potential deadlock. I also removed my changes in #15775 since they are not 
necessary now.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #16178 from zsxwing/fix-stop-deadlock.
    
    (cherry picked from commit 26432df9cc6ffe569583aa628c6ecd7050b38316)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 9483242f4c6cc13001e5a967810718b26beb2361
Author: Reynold Xin <r...@databricks.com>
Date:   2016-12-08T20:52:05Z

    [SPARK-18760][SQL] Consistent format specification for FileFormats
    
    ## What changes were proposed in this pull request?
    This patch fixes the format specification in explain for file sources 
(Parquet and Text formats are the only two that are different from the rest):
    
    Before:
    ```
    scala> spark.read.text("test.text").explain()
    == Physical Plan ==
    *FileScan text [value#15] Batched: false, Format: 
org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<value:string>
    ```
    
    After:
    ```
    scala> spark.read.text("test.text").explain()
    == Physical Plan ==
    *FileScan text [value#15] Batched: false, Format: Text, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<value:string>
    ```
    
    Also closes #14680.
    
    ## How was this patch tested?
    Verified in spark-shell.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #16187 from rxin/SPARK-18760.
    
    (cherry picked from commit 5f894d23a54ea99f75f8b722e111e5270f7f80cf)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit e43209fe2a69fb239dff8bc1a18297d3696f0dcd
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-12-08T21:01:46Z

    [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6
    
    This PR changes the SparkR source release tarball to be built using the 
Hadoop 2.6 profile. Previously it was using the without hadoop profile which 
leads to an error as discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265843991
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #16218 from shivaram/fix-sparkr-release-build.
    
    (cherry picked from commit 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit fcd22e5389a7dffda32be0e143d772f611a0f3d9
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-12-09T01:53:34Z

    [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in 
json
    
    ## What changes were proposed in this pull request?
    
    - Changed FileStreamSource to use new FileStreamSourceOffset rather than 
LongOffset. The field is named as `logOffset` to make it more clear that this 
is a offset in the file stream log.
    - Fixed bug in FileStreamSourceLog, the field endId in the 
FileStreamSourceLog.get(startId, endId) was not being used at all. No test 
caught it earlier. Only my updated tests caught it.
    
    Other minor changes
    - Dont use batchId in the FileStreamSource, as calling it batch id is 
extremely miss leading. With multiple sources, it may happen that a new batch 
has no new data from a file source. So offset of FileStreamSource != batchId 
after that batch.
    
    ## How was this patch tested?
    
    Updated unit test.
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #16205 from tdas/SPARK-18776.
    
    (cherry picked from commit 458fa3325e5f8c21c50e406ac8059d6236f93a9c)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-12-09T01:58:44Z

    [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles 
is enabled (branch 2.1)
    
    ## What changes were proposed in this pull request?
    
    Backport #16203 to branch 2.1.
    
    ## How was this patch tested?
    
    Jennkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #16216 from zsxwing/SPARK-18774-2.1.

commit ef5646b4c6792a96e85d1dd4bb3103ba8306949b
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-12-09T02:26:54Z

    [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove 
pip tar.gz from distribution
    
    ## What changes were proposed in this pull request?
    
    Fixes name of R source package so that the `cp` in release-build.sh works 
correctly.
    
    Issue discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265867125
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #16221 from shivaram/fix-sparkr-release-build-name.
    
    (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 4ceed95b43d0cd9665004865095a40926efcc289
Author: wm...@hotmail.com <wm...@hotmail.com>
Date:   2016-12-09T06:08:19Z

    [SPARK-18349][SPARKR] Update R API documentation on ml model summary
    
    ## What changes were proposed in this pull request?
    In this PR, the document of `summary` method is improved in the format:
    
    returns summary information of the fitted model, which is a list. The list 
includes .......
    
    Since `summary` in R is mainly about the model, which is not the same as 
`summary` object on scala side, if there is one, the scala API doc is not 
pointed here.
    
    In current document, some `return` have `.` and some don't have. `.` is 
added to missed ones.
    
    Since spark.logit `summary` has a big refactoring, this PR doesn't include 
this one. It will be changed when the `spark.logit` PR is merged.
    
    ## How was this patch tested?
    
    Manual build.
    
    Author: wm...@hotmail.com <wm...@hotmail.com>
    
    Closes #16150 from wangmiao1981/audit2.
    
    (cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6)
    Signed-off-by: Felix Cheung <felixche...@apache.org>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17098: Branch 2.1

Reply via email to