date:20180126

[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-26 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20372
  
please see https://spark.apache.org/contributing.html
open a JIRA and update this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164266681
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

I think this is somewhat conflicted ... If I can SparkSession.getOrCreate 
in python and JVM already has a defaultSession, what should happen?
I suppose one approach would be to keep python independent.. but with this 
change it would overwrite the defaultSession that might be valid?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...

2018-01-26 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20383
  
without this fix streaming app can't properly recovery from checkpoint, 
correct?
that seems fairly important to me.

was apache-spark-on-k8s/spark-integration done before this PR was opened?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20403
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20409
  
About test case, can't we just use the reproducer in the PR description to 
check it we change the deterministic status of udf?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20409
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecut...

2018-01-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20413


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...

2018-01-26 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20413
  
Thanks! Merging to master and 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20415: [SPARK-23247][SQL]combines Unsafe operations and ...

2018-01-26 Thread heary-cao

GitHub user heary-cao opened a pull request:

https://github.com/apache/spark/pull/20415

[SPARK-23247][SQL]combines Unsafe operations and statistics operations in 
Scan Data Source

## What changes were proposed in this pull request?

Currently, we scan the execution plan of the data source, first the unsafe 
operation of each row of data, and then re traverse the data for the count of 
rows. In terms of performance, this is not necessary. this PR combines the two 
operations and makes statistics on the number of rows while performing the 
unsafe operation.

Before modified,

```
val unsafeRow = rdd.mapPartitionsWithIndexInternal { (index, iter) =>
val proj = UnsafeProjection.create(schema)
 proj.initialize(index)
iter.map(proj)
}

val numOutputRows = longMetric("numOutputRows")
unsafeRow.map { r =>
numOutputRows += 1
 r
}


```
After modified,

val numOutputRows = longMetric("numOutputRows")

rdd.mapPartitionsWithIndexInternal { (index, iter) =>
  val proj = UnsafeProjection.create(schema)
  proj.initialize(index)
  iter.map( r => {
numOutputRows += 1
proj(r)
  })
}

## How was this patch tested?

the existed test cases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heary-cao/spark DataSourceScanExec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20415


commit e3e09d98072bd39328a4e7d4de1ddd38594c6232
Author: caoxuewen 
Date:   2018-01-27T06:27:37Z

combines Unsafe operations and statistics operations in Scan Data Source




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20413
  
**[Test build #4079 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4079/testReport)**
 for PR 20413 at commit 
[`3ce18c4`](https://github.com/apache/spark/commit/3ce18c47fb50f6c1f12db1fb34a6196e5040694d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-26 Thread ConeyLiu

Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/19285
  
thanks all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-26 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164261850
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +124,19 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, int) and isinstance(end, int):
+if start <= Window._PRECEDING_THRESHOLD:
+start = Window.unboundedPreceding
--- End diff --

@jiangxb1987 
Do you mean to change to 
```
if isinstance(start, int) and isinstance(end, int):
if start == Window._PRECEDING_THRESHOLD:  
# Window._PRECEDING_THRESHOLD == Long.MinValue
start = Window.unboundedPreceding
if end == Window._FOLLOWING_THRESHOLD:
# Window._FOLLOWING_THRESHOLD == Long.MaxValue
end = Window.unboundedFollowing
```
I ran python tests, tests.py failed at 
```
with patch("sys.maxsize", 2 ** 127 - 1):
importlib.reload(window)
self.assertTrue(rows_frame_match())
self.assertTrue(range_frame_match())
```
So I guess I will keep 
```if start <= Window._PRECEDING_THRESHOLD```
 and 
``` if end >= Window._FOLLOWING_THRESHOLD```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples

2018-01-26 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20401
  
Thank you for merging this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164261725
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
--- End diff --

We need to warn users that, the Group Map Pandas UDF requires to load all 
the data of a group into memory, which is not controlled by 
`spark.sql.execution.arrow.maxRecordsPerBatch`, and may OOM if the data is 
skewed and some partitions have a lot of records.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164261678
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +124,19 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, int) and isinstance(end, int):
--- End diff --

Yup, but I think we can still have `long` type one:

```
>>> long(1)
1L
>>> isinstance(long(1), int)
False
```

You can simply do like `isinstance(long(1), (int, long))` with 

```
if sys.version >= '3':
long = int
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164261639
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-26 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164261413
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +124,19 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, int) and isinstance(end, int):
--- End diff --

@HyukjinKwon @jiangxb1987 
Thank you very much for your comments. 
It seems to me that int and long are "unified" in python 2. I tried the 
following:
```
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> value = 9223372036854775807
>>> isinstance(value, int)
True
```
It seems to me that we don't have to do long for Python 2. 
I guess I will keep
```if isinstance(start, int) and isinstance(end, int)```
?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20409
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20409
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20403
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164260764
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-26 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20369
  
LGTM, pending jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164260619
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
--- End diff --

Which name? If you mean "split-apply-combine", I think it's fine - 
https://pandas.pydata.org/pandas-docs/stable/groupby.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...

2018-01-26 Thread ymazari

Github user ymazari commented on a diff in the pull request:

https://github.com/apache/spark/pull/20367#discussion_r164260027
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -160,6 +187,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
 } else {
   $(minDF) * input.cache().count()
 }
+val maxDf = if ($(maxDF) >= 1.0) {
+  $(maxDF)
+} else {
+  $(maxDF) * input.cache().count()
+}
--- End diff --

Good points.
- I added a check that maxDF >= minDF
- Changed the code so that counting (and caching) is done only once
- I refactored the code so that "filter()" is only invoked if minDF or 
maxDF is set
- Added an un-persisting the input after the counting is done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20369: [SPARK-23196] Unify continuous and microbatch V2 sinks

2018-01-26 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20369
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86722/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20407
  
**[Test build #86722 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86722/testReport)**
 for PR 20407 at commit 
[`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86725/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20208
  
**[Test build #86725 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)**
 for PR 20208 at commit 
[`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-26 Thread vgankidi

Github user vgankidi commented on the issue:

https://github.com/apache/spark/pull/20372
  
I agree with @ash211. Applications shouldn't rely on the order of the files 
within a partition. 
This optimization looks good to me. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20413
  
**[Test build #4079 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4079/testReport)**
 for PR 20413 at commit 
[`3ce18c4`](https://github.com/apache/spark/commit/3ce18c47fb50f6c1f12db1fb34a6196e5040694d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc exa...

2018-01-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20401


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20401: [MINOR][SS][DOC] Fix `Trigger` Scala/Java doc examples

2018-01-26 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20401
  
Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...

2018-01-26 Thread jose-torres

Github user jose-torres commented on the issue:

https://github.com/apache/spark/pull/20413
  
+1. I originally wrote this line, and I'm reasonably confident that (as 
indicated by the comment) I didn't intend to check anything other than the 
nullity of lastExecution. I continue to be confused about why I wrote such a 
strange null check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20394


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20413: [SPARK-23245][SS][TESTS] Don't access `lastExecution.exe...

2018-01-26 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20413
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20394
  
LGTM

Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20394#discussion_r164254321
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -73,11 +73,16 @@ case class InMemoryRelation(
   @transient val partitionStatistics = new PartitionStatistics(output)
 
   override def computeStats(): Statistics = {
-if (batchStats.value == 0L) {
-  // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache
-  statsOfPlanToCache
+if (sizeInBytesStats.value == 0L) {
+  // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache.
+  // Note that we should drop the hint info here. We may cache a plan 
whose root node is a hint
+  // node. When we lookup the cache with a semantically same plan 
without hint info, the plan
+  // returned by cache lookup should not have hint info. If we lookup 
the cache with a
+  // semantically same plan with a different hint info, 
`CacheManager.useCachedData` will take
+  // care of it and retain the hint info in the lookup input plan.
+  statsOfPlanToCache.copy(hints = HintInfo())
--- End diff --

This is a new behavior we introduced in 2.3. I will first keep the behavior 
unchanged and merge it to 2.3. 

We can have more discussion in the next release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD ...

2018-01-26 Thread jiangxb1987

GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/20414

[SPARK-23243][SQL] Shuffle+Repartition on an RDD could lead to incorrect 
answers

## What changes were proposed in this pull request?

The RDD repartition also uses the round-robin way to distribute data, this 
can also cause incorrect answers on RDD workload the similar way as in #20393

However, the approach that fixes DataFrame.repartition() doesn't apply on 
the RDD repartition issue, because the input data can be non-comparable, as 
discussed in https://github.com/apache/spark/pull/20393#issuecomment-360912451

Here, I propose a quick fix that distribute elements use their hashes, this 
will cause perf regression if you have highly skewed input data, but it will 
ensure result correctness. 

## How was this patch tested?

Added test case in `RDDSuite` to ensure `RDD.repartition()` generate 
consistent answers.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark rdd-repartition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20414


commit 6910ed62c272bedfa251cab589bb52bed36be3ed
Author: Xingbo Jiang 
Date:   2018-01-27T00:34:24Z

fix RDD.repartition()




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkS...

2018-01-26 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20404#discussion_r164253931
  
--- Diff: python/pyspark/sql/session.py ---
@@ -225,6 +225,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 if SparkSession._instantiatedSession is None \
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
+
self._jvm.org.apache.spark.sql.SparkSession.setDefaultSession(self._jsparkSession)
--- End diff --

Yeah, I think we can clear it in `def stop`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/20393
  
I'm fine with merging this -- I just dont want to this issue to be 
forgotten for RDDs as I think its a major correctness issue.  

@mridulm @sameeragarwal Lets continue the discussion on the new JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20413: [SC-9624][SS][TESTS] Don't access `lastExecution....

2018-01-26 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/20413

[SC-9624][SS][TESTS] Don't access `lastExecution.executedPlan` in StreamTest

## What changes were proposed in this pull request?

`lastExecution.executedPlan` is lazy val so accessing it in StreamTest may 
need to acquire the lock of `lastExecution`. It may be waiting forever when the 
streaming thread is holding it and running a continuous Spark job.

This PR changes to check if `s.lastExecution` is null to avoid accessing 
`lastExecution.executedPlan`.

## How was this patch tested?

Jenkins

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-23245

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20413


commit 3ce18c47fb50f6c1f12db1fb34a6196e5040694d
Author: Jose Torres 
Date:   2018-01-26T22:57:18Z

better check




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164252270
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark issue #20413: [SC-9624][SS][TESTS] Don't access `lastExecution.execute...

2018-01-26 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20413
  
cc @jose-torres @tdas 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164252424
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164251821
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164251759
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
+  input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
+
+
+
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+
+
+
+For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+###

[GitHub] spark pull request #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaS...

2018-01-26 Thread zsxwing

Github user zsxwing closed the pull request at:

https://github.com/apache/spark/pull/20412


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20412
  
Thanks! Merging to master and 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20412
  
**[Test build #4078 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4078/testReport)**
 for PR 20412 at commit 
[`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class KafkaSourceSuiteBase extends KafkaSourceTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20405: [SPARK-23229][SQL] Dataset.hint should use planWi...

2018-01-26 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20405#discussion_r164250206
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1216,7 +1216,7 @@ class Dataset[T] private[sql](
*/
   @scala.annotation.varargs
   def hint(name: String, parameters: Any*): Dataset[T] = withTypedPlan {
-UnresolvedHint(name, parameters, logicalPlan)
+UnresolvedHint(name, parameters, planWithBarrier)
--- End diff --

I think @viirya has a valid concern. think about
```
val df1 = spark.table("t").select("id")
df1.hint("broadcast", "id")
```
We should transform down the plan of `df1`, find the bottom table relation 
and apply the hint.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

2018-01-26 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164250138
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See 
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark 
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+
+
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
+DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
+data types are currently supported and an error can be raised if a column 
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required. 
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
+as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset 
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
+
+
+
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+
+
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
--- End diff --

@rxin WDYT about this name?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20395: [SPARK-23218][SQL] simplify ColumnVector.getArray

2018-01-26 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20395#discussion_r164249345
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -450,13 +439,11 @@ final boolean isNullAt(int rowId) {
 }
 
 @Override
-final int getArrayLength(int rowId) {
-  return accessor.getInnerValueCountAt(rowId);
-}
-
-@Override
-final int getArrayOffset(int rowId) {
-  return accessor.getOffsetBuffer().getInt(rowId * 
accessor.OFFSET_WIDTH);
+final ColumnarArray getArray(int rowId) {
+  int index = rowId * accessor.OFFSET_WIDTH;
+  int start = offsets.getInt(index);
+  int end = offsets.getInt(index + accessor.OFFSET_WIDTH);
--- End diff --

yes, the code is correct.  `offsetBuffer` will be sized to at least 
`(numRows + 1) * accessor.OFFSET_WIDTH`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFa...

2018-01-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20397#discussion_r164248501
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java
 ---
@@ -30,21 +30,21 @@
 @InterfaceStability.Evolving
 public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
   @Override
-  default List createReadTasks() {
+  default List createDataReaderFactories() {
--- End diff --

We shall create only one `DataReaderFactory`, and have that create multiple 
data readers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-26 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164248098
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +124,19 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, int) and isinstance(end, int):
--- End diff --

oh, thanks @HyukjinKwon , I'm not familiar with python :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20412
  
**[Test build #4078 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4078/testReport)**
 for PR 20412 at commit 
[`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20400#discussion_r164245801
  
--- Diff: python/pyspark/sql/window.py ---
@@ -124,16 +124,19 @@ def rangeBetween(start, end):
 values directly.
 
 :param start: boundary start, inclusive.
-  The frame is unbounded if this is 
``Window.unboundedPreceding``, or
+  The frame is unbounded if this is 
``Window.unboundedPreceding``,
+  
``org.apache.spark.sql.catalyst.expressions.UnboundedPreceding``, or
   any value less than or equal to max(-sys.maxsize, 
-9223372036854775808).
 :param end: boundary end, inclusive.
-The frame is unbounded if this is 
``Window.unboundedFollowing``, or
+The frame is unbounded if this is 
``Window.unboundedFollowing``,
+
``org.apache.spark.sql.catalyst.expressions.UnboundedPFollowing``, or
 any value greater than or equal to min(sys.maxsize, 
9223372036854775807).
 """
-if start <= Window._PRECEDING_THRESHOLD:
-start = Window.unboundedPreceding
-if end >= Window._FOLLOWING_THRESHOLD:
-end = Window.unboundedFollowing
+if isinstance(start, int) and isinstance(end, int):
--- End diff --

FYI, Python 3 doesn't have `long` and it was merged to `int`. We should do 
`long` here and assign `int` to `long` in Python 3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20410: [SPARK-23234][ML][PYSPARK] Remove setting default...

2018-01-26 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20410#discussion_r164243811
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -118,10 +118,9 @@ def _transfer_params_to_java(self):
 """
 Transforms the embedded params to the companion Java object.
 """
-paramMap = self.extractParamMap()
 for param in self.params:
-if param in paramMap:
-pair = self._make_java_param_pair(param, paramMap[param])
+if param in self._paramMap:
+pair = self._make_java_param_pair(param, 
self._paramMap[param])
--- End diff --

The intent would be more clear if you called
```
if self.isSet(param):
pair = self._make_java_param_pair(param, self.getOrDefault[param])
```
And to be 100% consistent with Scala, it might be a good idea to transfer 
default values anyway.  That way, just in case the python default was somehow 
changed or different than scala it wouldn't cause an issue that would be really 
hard to detect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20409: [SPARK-23233][PYTHON] Reset the cache in asNondeterminis...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20409
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20398
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20398
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86723/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataF...

2018-01-26 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20393


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20398
  
**[Test build #86723 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86723/testReport)**
 for PR 20398 at commit 
[`3093d91`](https://github.com/apache/spark/commit/3093d919ada739fca10b1fcc73e9b2c620eeeb16).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20393
  
I opened https://issues.apache.org/jira/browse/SPARK-23243 to track the 
RDD.repartition() patch, thanks for all the discussions! @shivaram @mridulm 
@sameeragarwal @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread sameeragarwal

Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20393
  
LGTM but we should get a broader consensus on this. In the meantime, I'm 
merging this patch to master/2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20409: [SPARK-23233][PYTHON] Reset the cache in asNondet...

2018-01-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20409#discussion_r164242518
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -188,6 +188,9 @@ def asNondeterministic(self):
 
 .. versionadded:: 2.3
 """
+# Here, we explicitly clean the cache to create a JVM UDF instance
+# with 'deterministic' updated. See SPARK-23233.
+self._judf_placeholder = None
--- End diff --

Do you mean the reproducer above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20408: [SPARK-23189][Core][Web UI] Reflect stage level blacklis...

2018-01-26 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/20408
  
I'll take a look at the code when I have a moment, but from a UI 
perspective only have one issue. Having the status of `Blacklisted in Stages: 
[...]` when an exec is Active could be easily misunderstood. I think it should 
still say `Active`, though `Active (Blacklisted in Stages: [...])` is a bit 
wordy so I'm not sure how best to do so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/20393
  
Updated the title, does it sound good to have this PR? I'll open another 
one to address the RDD.repartition() issue (which will target to 2.4). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread sameeragarwal

Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20393
  
Another (possibly cleaner) approach here would be to make the shuffle block 
fetch order deterministic but I agree that it might not be safe to include it 
in 2.3 this late.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20412
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20412
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86724/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20412
  
**[Test build #86724 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86724/testReport)**
 for PR 20412 at commit 
[`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class KafkaSourceSuiteBase extends KafkaSourceTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20412
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20396: [SPARK-23217][ML] Add cosine distance measure to Cluster...

2018-01-26 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20396
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20410: [SPARK-23234][ML][PYSPARK] Remove setting defaults on Ja...

2018-01-26 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/20410
  
I think that the problem is not SPARK-22797. The problem is that before 
this PR, the Python API considers as Defined but not Set all the parameters 
with a default value, while the Scala/Java class representing it considers as 
Set all them.

This has come up in this case, but it can cause other problems in the 
future and also now, because it creates an inconsistency between the Python API 
and the representation in the JVM backend.

Thus I do believe that this PR is needed, and it is not only a fix for the 
test failures. I think this is a first step and a second step would later be to 
drop all the `setDefault` in the Python API, in favor of retrieving them from 
the JVM backend. In this way, we will be sure there is no logical inconsistency 
between the API and the backend.

Unfortunately, this second part is much bigger and has a large impact. So I 
think it best would need a design doc or something similar. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...

2018-01-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20332#discussion_r164237753
  
--- Diff: docs/ml-classification-regression.md ---
@@ -111,10 +110,9 @@ Continuing the earlier example:
 
[`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
 provides a summary for a
 
[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
-Currently, only binary classification is supported and the
-summary must be explicitly cast to

-[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
 
-Support for multiclass model summaries will be added in the future.
+In the case of binary classification, certain additional metrics are
--- End diff --

Now `BinaryLogisticRegressionTrainingSummary` inherits 
`LogisticRegressionSummary` so that inherits all metrics in 
`LogisticRegressionSummary`.  We'd better mark them in doc. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20403
  
**[Test build #86726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86726/testReport)**
 for PR 20403 at commit 
[`1f4d288`](https://github.com/apache/spark/commit/1f4d2884ba5b56e06427ce3d91cb6ac5f8f2b7b6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/302/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20403
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20208
  
Hi, @rxin , @cloud-fan , @sameeragarwal , @HyukjinKwon .

Could you give me some opinions about this PR? I know that Xiao Li is busy 
for this period, so I didn't ping hime. For me, this PR is important. Sorry for 
annoying you guys.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20403
  
this has been failing due to #19892 which was recently reverted


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20403: [SPARK-23238][PYTHON] Externalize SQLConf spark.sql.exec...

2018-01-26 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20403
  
jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20412
  
**[Test build #86724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86724/testReport)**
 for PR 20412 at commit 
[`bd124b3`](https://github.com/apache/spark/commit/bd124b3c23a60200c83c25b6c786795a188bb9aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20208
  
**[Test build #86725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86725/testReport)**
 for PR 20208 at commit 
[`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/301/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/300/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20412
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20407
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20412
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/299/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSu...

2018-01-26 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20412
  
LGTM. My bad, I should caught this in the original PR that @jose-torres 
made.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-26 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20208
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20407
  
**[Test build #86722 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86722/testReport)**
 for PR 20407 at commit 
[`074c342`](https://github.com/apache/spark/commit/074c34245d300901390d2d5ed74bb69e32539b8a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20398: [SPARK-23221][SS][TEST] Fix KafkaContinuousSourceStressF...

2018-01-26 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20398
  
**[Test build #86723 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86723/testReport)**
 for PR 20398 at commit 
[`3093d91`](https://github.com/apache/spark/commit/3093d919ada739fca10b1fcc73e9b2c620eeeb16).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20412: [SPARK-23242][SS][Tests]Don't run tests in KafkaS...

2018-01-26 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/20412

[SPARK-23242][SS][Tests]Don't run tests in KafkaSourceSuiteBase twice

## What changes were proposed in this pull request?

KafkaSourceSuiteBase should be abstract class, otherwise 
KafkaSourceSuiteBase will also run.

## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-23242

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20412


commit bd124b3c23a60200c83c25b6c786795a188bb9aa
Author: Shixiong Zhu 
Date:   2018-01-26T22:04:12Z

Don't run KafkaSourceSuiteBase




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20407
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20407: [SPARK-23124][SQL] Allow to disable BroadcastNestedLoopJ...

2018-01-26 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20407
  
SPARK-23234 is reverted now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-26 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/20393
  
@jiangxb1987 Btw, we could argue this is a correctness issue since we added 
repartition - so not necessarily blocker :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 446 matches

Mail list logo