[GitHub] [spark] cloud-fan commented on a change in pull request #29771: [SPARK-32635][SQL] Fix foldable propagation

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29771:
URL: https://github.com/apache/spark/pull/29771#discussion_r489988813



##
File path: 
sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
##
@@ -26,6 +26,8 @@ object AttributeMap {
   def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = {
 new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap)
   }
+
+  def empty[A]: AttributeMap[A] = new AttributeMap(Map.empty)

Review comment:
   We should add it in the scala-2.13 source tree as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a change in pull request #29775: [SPARK-24994][SQL][FOLLOW-UP] Handle foldable, timezone and cleanup

2020-09-16 Thread GitBox


sunchao commented on a change in pull request #29775:
URL: https://github.com/apache/spark/pull/29775#discussion_r489988859



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala
##
@@ -103,9 +103,9 @@ object UnwrapCastInBinaryComparison extends 
Rule[LogicalPlan] {
 // In case both sides have integral type, optimize the comparison by 
removing casts or
 // moving cast to the literal side.
 case be @ BinaryComparison(
-  Cast(fromExp, toType: IntegralType, _), Literal(value, literalType))
+  Cast(fromExp, toType: IntegralType, tz), Literal(value, literalType))
 if canImplicitlyCast(fromExp, toType, literalType) =>
-  simplifyIntegralComparison(be, fromExp, toType, value)
+  simplifyIntegralComparison(be, fromExp, toType, value, tz)

Review comment:
   (oops just found out my comment was not sent out successfully)
   
   This is because `ResolveTimeZone` will try to add timezone info to all 
expressions that don't have it during query analysis. However, since the `Cast` 
expr was generated at optimization phase, it will not have the timezone info. 
As result, `PlanTest.comparePlans` will fail because of mismatch. I can try to 
come up with a test if necessary.
   
   On the other hand, I think instead of using `Cast`, we may just directly use 
the value, since the `Cast` will be optimized away by ConstantFolding anyways 
later. What do you think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29767: [SPARK-32896][SS] Add DataStreamWriter.table API

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29767:
URL: https://github.com/apache/spark/pull/29767#discussion_r489988109



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
##
@@ -300,97 +301,108 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 "write files of Hive data source directly.")
 }
 
-if (source == "memory") {
-  assertNotPartitioned("memory")
-  if (extraOptions.get("queryName").isEmpty) {
-throw new AnalysisException("queryName must be specified for memory 
sink")
-  }
-  val sink = new MemorySink()
-  val resultDf = Dataset.ofRows(df.sparkSession, new MemoryPlan(sink, 
df.schema.toAttributes))
-  val chkpointLoc = extraOptions.get("checkpointLocation")
-  val recoverFromChkpoint = outputMode == OutputMode.Complete()
-  val query = 
df.sparkSession.sessionState.streamingQueryManager.startQuery(
-extraOptions.get("queryName"),
-chkpointLoc,
-df,
-extraOptions.toMap,
-sink,
-outputMode,
-useTempCheckpointLocation = true,
-recoverFromCheckpointLocation = recoverFromChkpoint,
-trigger = trigger)
-  resultDf.createOrReplaceTempView(query.name)
-  query
-} else if (source == "foreach") {
-  assertNotPartitioned("foreach")
-  val sink = ForeachWriterTable[T](foreachWriter, ds.exprEnc)
-  df.sparkSession.sessionState.streamingQueryManager.startQuery(
-extraOptions.get("queryName"),
-extraOptions.get("checkpointLocation"),
-df,
-extraOptions.toMap,
-sink,
-outputMode,
-useTempCheckpointLocation = true,
-trigger = trigger)
-} else if (source == "foreachBatch") {
-  assertNotPartitioned("foreachBatch")
-  if (trigger.isInstanceOf[ContinuousTrigger]) {
-throw new AnalysisException("'foreachBatch' is not supported with 
continuous trigger")
-  }
-  val sink = new ForeachBatchSink[T](foreachBatchWriter, ds.exprEnc)
-  df.sparkSession.sessionState.streamingQueryManager.startQuery(
-extraOptions.get("queryName"),
-extraOptions.get("checkpointLocation"),
-df,
-extraOptions.toMap,
-sink,
-outputMode,
-useTempCheckpointLocation = true,
-trigger = trigger)
-} else {
-  val cls = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
-  val disabledSources = 
df.sparkSession.sqlContext.conf.disabledV2StreamingWriters.split(",")
-  val useV1Source = disabledSources.contains(cls.getCanonicalName) ||
-// file source v2 does not support streaming yet.
-classOf[FileDataSourceV2].isAssignableFrom(cls)
-
-  val optionsWithPath = if (path.isEmpty) {
-extraOptions
-  } else {
-extraOptions + ("path" -> path.get)
-  }
+val queryName = extraOptions.get("queryName")
+val checkpointLocation = extraOptions.get("checkpointLocation")
+val useTempCheckpointLocation = 
SOURCES_ALLOW_ONE_TIME_QUERY.contains(source)
+
+val (sink, resultDf, recoverFromCheckpoint, newOptions) = {
+  if (source == SOURCE_NAME_TABLE) {
+assertNotPartitioned("table")
+
+import 
df.sparkSession.sessionState.analyzer.{NonSessionCatalogAndIdentifier, 
SessionCatalogAndIdentifier}
+
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+val tableInstance = df.sparkSession.sessionState.sqlParser
+  .parseMultipartIdentifier(tableName) match {
+
+  case NonSessionCatalogAndIdentifier(catalog, ident) =>
+catalog.asTableCatalog.loadTable(ident)
+
+  case SessionCatalogAndIdentifier(catalog, ident) =>
+catalog.asTableCatalog.loadTable(ident)
+
+  case other =>
+throw new AnalysisException(
+  s"Couldn't find a catalog to handle the identifier 
${other.quoted}.")
+}
 
-  val sink = if (classOf[TableProvider].isAssignableFrom(cls) && 
!useV1Source) {
-val provider = 
cls.getConstructor().newInstance().asInstanceOf[TableProvider]
-val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
-  source = provider, conf = df.sparkSession.sessionState.conf)
-val finalOptions = 
sessionOptions.filterKeys(!optionsWithPath.contains(_)).toMap ++
-  optionsWithPath.originalMap
-val dsOptions = new CaseInsensitiveStringMap(finalOptions.asJava)
-val table = DataSourceV2Utils.getTableFromProvider(
-  provider, dsOptions, userSpecifiedSchema = None)
 import 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
-table match {
-  case table: SupportsWrite if table.supports(STREAMING_WRITE) =>
-table
-  case _ => createV1Sink(optionsWithPath)
+val sink = tableInstance match {
+  case t: SupportsWrite if 

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489987359



##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.
+
+.. code-block:: bash
+
+$ ./build/sbt package
+
+After building is finished, run PyCharm and select the path ``spark/python``.
+
+.. image:: ../../../../docs/img/pycharm-with-pyspark1.png
+:alt: Setup PyCharm with PySpark
+
+
+Let's go to the path ``python/pyspark/tests`` in PyCharm and try to run the 
any test like ``test_join.py``.
+You might can see the ``KeyError: 'SPARK_HOME'`` because the environment 
variable has not been set yet.
+
+Go **Run -> Edit Configurations**, and set the environment variables as below.
+Please make sure to specify your own path for ``SPARK_HOME`` rather than 
``/.../spark``. After completing the variable, click **Apply** to apply the 
changes.

Review comment:
   Looks like the image has to be updated too





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-693923179


   Nice, thanks @itholic. Can you also add a link to here at 
https://github.com/apache/spark/blob/master/python/docs/source/development/debugging.rst#remote-debugging-pycharm-professional
 ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489985646



##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.
+
+.. code-block:: bash
+
+$ ./build/sbt package
+
+After building is finished, run PyCharm and select the path ``spark/python``.
+
+.. image:: ../../../../docs/img/pycharm-with-pyspark1.png
+:alt: Setup PyCharm with PySpark

Review comment:
   Can we have a different alternative texts for each image?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489985515



##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.
+
+.. code-block:: bash
+
+$ ./build/sbt package
+
+After building is finished, run PyCharm and select the path ``spark/python``.
+
+.. image:: ../../../../docs/img/pycharm-with-pyspark1.png
+:alt: Setup PyCharm with PySpark
+
+
+Let's go to the path ``python/pyspark/tests`` in PyCharm and try to run the 
any test like ``test_join.py``.
+You might can see the ``KeyError: 'SPARK_HOME'`` because the environment 
variable has not been set yet.
+
+Go **Run -> Edit Configurations**, and set the environment variables as below.
+Please make sure to specify your own path for ``SPARK_HOME`` rather than 
``/.../spark``. After completing the variable, click **Apply** to apply the 
changes.
+
+.. image:: ../../../../docs/img/pycharm-with-pyspark2.png
+:alt: Setup PyCharm with PySpark
+
+
+Once ``SPARK_HOME`` is set properly, you'll be able to see the **Tests 
passed** when you run test again.

Review comment:
   `you'll be able to see the **Tests passed** when you run test again.` -> 
`you will be able to run the tests properly as below:`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29775: [SPARK-24994][SQL][FOLLOW-UP] Handle foldable, timezone and cleanup

2020-09-16 Thread GitBox


cloud-fan commented on pull request #29775:
URL: https://github.com/apache/spark/pull/29775#issuecomment-693920672


   > pass timezone info to the generated cast on the literal value
   
   I'd also like to understand this more. `Cast.canonicalize` will drop the 
timezone if it's not needed, so `UnwrapCastInBinaryComparison` shouldn't care 
about timezone.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489985388



##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.
+
+.. code-block:: bash
+
+$ ./build/sbt package
+
+After building is finished, run PyCharm and select the path ``spark/python``.
+
+.. image:: ../../../../docs/img/pycharm-with-pyspark1.png
+:alt: Setup PyCharm with PySpark
+
+
+Let's go to the path ``python/pyspark/tests`` in PyCharm and try to run the 
any test like ``test_join.py``.
+You might can see the ``KeyError: 'SPARK_HOME'`` because the environment 
variable has not been set yet.
+
+Go **Run -> Edit Configurations**, and set the environment variables as below.
+Please make sure to specify your own path for ``SPARK_HOME`` rather than 
``/.../spark``. After completing the variable, click **Apply** to apply the 
changes.

Review comment:
   I think we should click `**Okay**` so the dialog can be closed?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun edited a comment on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693912369


   @jzc928 . I left a few comments. Please update the PR accordingly. Although 
this is different from Parquet, but this is the same with JSON data source. So, 
I think we can accept this approach after revising the PR and passing Jenkins 
CI tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489985125



##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.
+
+.. code-block:: bash
+
+$ ./build/sbt package

Review comment:
   `./build/sbt package` -> `build/sbt package` for consistency





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693918836







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-693918935







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-693918935







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693918836







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r489984826



##
File path: python/docs/source/development/index.rst
##
@@ -25,3 +25,4 @@ Development
 contributing
 testing
 debugging
+setting

Review comment:
   `setting` -> `setting_ide`

##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark

Review comment:
   PySpark -> PyCharm

##
File path: python/docs/source/development/setting.rst
##
@@ -0,0 +1,58 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Setting up PySpark
+==
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+$ git clone https://github.com/apache/spark.git

Review comment:
   Let's remove `$` for consistency.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


SparkQA commented on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693916933


   **[Test build #128797 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128797/testReport)**
 for PR 29761 at commit 
[`c3c7f4c`](https://github.com/apache/spark/commit/c3c7f4cbd7d9b16ae4ebccd73d1f8d03f4446e8a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29773: [SPARK-32287][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite on GithubActions

2020-09-16 Thread GitBox


Ngone51 commented on a change in pull request #29773:
URL: https://github.com/apache/spark/pull/29773#discussion_r489984741



##
File path: core/src/main/scala/org/apache/spark/internal/config/Tests.scala
##
@@ -26,11 +26,11 @@ private[spark] object Tests {
 .longConf
 .createWithDefault(Runtime.getRuntime.maxMemory)
 
-  val TEST_SCHEDULE_INTERVAL =
-ConfigBuilder("spark.testing.dynamicAllocation.scheduleInterval")
-  .version("2.3.0")
-  .longConf
-  .createWithDefault(100)
+  val TEST_DYNAMIC_ALLOCATION_SCHEDULE_ENABLED =

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


SparkQA commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-693916737


   **[Test build #128796 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128796/testReport)**
 for PR 29781 at commit 
[`b5feb02`](https://github.com/apache/spark/commit/b5feb02cdceaf7ddbfa64e3ec06d885a736800c6).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29773: [SPARK-32287][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite on GithubActions

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29773:
URL: https://github.com/apache/spark/pull/29773#discussion_r489983915



##
File path: core/src/main/scala/org/apache/spark/internal/config/Tests.scala
##
@@ -26,11 +26,11 @@ private[spark] object Tests {
 .longConf
 .createWithDefault(Runtime.getRuntime.maxMemory)
 
-  val TEST_SCHEDULE_INTERVAL =
-ConfigBuilder("spark.testing.dynamicAllocation.scheduleInterval")
-  .version("2.3.0")
-  .longConf
-  .createWithDefault(100)
+  val TEST_DYNAMIC_ALLOCATION_SCHEDULE_ENABLED =

Review comment:
   shall we turn it on by default then? I think we only need to disable it 
for one specific suite. This is also the behavior before this PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693912369


   @jzc928 . I left a few comments. Please the PR accordingly. Although this is 
different from Parquet, but this is the same with JSON data source. So, I think 
we can accept this approach after revising the PR and passing Jenkins CI tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29773: [SPARK-32287][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite on GithubActions

2020-09-16 Thread GitBox


Ngone51 commented on a change in pull request #29773:
URL: https://github.com/apache/spark/pull/29773#discussion_r489983365



##
File path: core/src/main/scala/org/apache/spark/internal/config/Tests.scala
##
@@ -26,11 +26,11 @@ private[spark] object Tests {
 .longConf
 .createWithDefault(Runtime.getRuntime.maxMemory)
 
-  val TEST_SCHEDULE_INTERVAL =
-ConfigBuilder("spark.testing.dynamicAllocation.scheduleInterval")
-  .version("2.3.0")
-  .longConf
-  .createWithDefault(100)
+  val TEST_DYNAMIC_ALLOCATION_SCHEDULE_ENABLED =

Review comment:
   When a new test needs it in the future...we don't need it now. Do you 
suggest to remove it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request #29781: [SPARK-32186][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-16 Thread GitBox


itholic opened a new pull request #29781:
URL: https://github.com/apache/spark/pull/29781


   ### What changes were proposed in this pull request?
   
   This PR proposes to document the way of setting PySpark with PyCharm.
   
   ![스크린샷 2020-09-17 오후 2 40 
34](https://user-images.githubusercontent.com/44108233/93424837-cd1a0f80-f8f3-11ea-8496-5f000f0229d1.png)
   ![스크린샷 2020-09-17 오후 2 40 
50](https://user-images.githubusercontent.com/44108233/93424845-cf7c6980-f8f3-11ea-93a4-fa9258a7d940.png)
   
   
   ### Why are the changes needed?
   
   To let users know how to setup PySpark with PyCharm.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds a new page in the documentation about setting PySpark.
   
   ### How was this patch tested?
   
   Manually built the doc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


viirya commented on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693907967


   Thanks all!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on pull request #29761:
URL: https://github.com/apache/spark/pull/29761#issuecomment-693908501


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


cloud-fan closed pull request #29776:
URL: https://github.com/apache/spark/pull/29776


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


cloud-fan commented on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693906765


   thanks, merging to master!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489981140



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -2206,39 +2206,63 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
-  test("SPARK-21912 ORC/Parquet table should not create invalid column names") 
{
+  test("SPARK-21912 Parquet table should not create invalid column names") {
 Seq(" ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
-  Seq("ORC", "PARQUET").foreach { source =>
-withTable("t21912") {
-  val m = intercept[AnalysisException] {
-sql(s"CREATE TABLE t21912(`col$name` INT) USING $source")
-  }.getMessage
-  assert(m.contains(s"contains invalid character(s)"))
+  val source = "PARQUET"
+  withTable("t21912") {
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE t21912(`col$name` INT) USING $source")
+}.getMessage
+assert(m.contains(s"contains invalid character(s)"))
 
-  val m1 = intercept[AnalysisException] {
-sql(s"CREATE TABLE t21912 STORED AS $source AS SELECT 1 
`col$name`")
-  }.getMessage
-  assert(m1.contains(s"contains invalid character(s)"))
+val m1 = intercept[AnalysisException] {
+  sql(s"CREATE TABLE t21912 STORED AS $source AS SELECT 1 `col$name`")
+}.getMessage
+assert(m1.contains(s"contains invalid character(s)"))
+
+val m2 = intercept[AnalysisException] {
+  sql(s"CREATE TABLE t21912 USING $source AS SELECT 1 `col$name`")
+}.getMessage
+assert(m2.contains(s"contains invalid character(s)"))
 
-  val m2 = intercept[AnalysisException] {
-sql(s"CREATE TABLE t21912 USING $source AS SELECT 1 `col$name`")
+withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
+  val m3 = intercept[AnalysisException] {
+sql(s"CREATE TABLE t21912(`col$name` INT) USING hive OPTIONS 
(fileFormat '$source')")
   }.getMessage
-  assert(m2.contains(s"contains invalid character(s)"))
+  assert(m3.contains(s"contains invalid character(s)"))
+}
 
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
-val m3 = intercept[AnalysisException] {
-  sql(s"CREATE TABLE t21912(`col$name` INT) USING hive OPTIONS 
(fileFormat '$source')")
-}.getMessage
-assert(m3.contains(s"contains invalid character(s)"))
-  }
+sql(s"CREATE TABLE t21912(`col` INT) USING $source")
+val m4 = intercept[AnalysisException] {
+  sql(s"ALTER TABLE t21912 ADD COLUMNS(`col$name` INT)")
+}.getMessage
+assert(m4.contains(s"contains invalid character(s)"))
+  }
+}
+  }
 
-  sql(s"CREATE TABLE t21912(`col` INT) USING $source")
-  val m4 = intercept[AnalysisException] {
-sql(s"ALTER TABLE t21912 ADD COLUMNS(`col$name` INT)")
-  }.getMessage
-  assert(m4.contains(s"contains invalid character(s)"))
+  test("SPARK-32889 ORC table column name supports special characters like $ 
eg.") {

Review comment:
   Could you use the following line?
   ```scala
   test("SPARK-32889: ORC table column name supports special characters") {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jiangxb1987 commented on pull request #29732: [SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every

2020-09-16 Thread GitBox


jiangxb1987 commented on pull request #29732:
URL: https://github.com/apache/spark/pull/29732#issuecomment-693895471


   can we simply resolve the issue by setting longer barrier sync timeout?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29773: [SPARK-32287][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite on GithubActions

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29773:
URL: https://github.com/apache/spark/pull/29773#discussion_r489980140



##
File path: core/src/main/scala/org/apache/spark/internal/config/Tests.scala
##
@@ -26,11 +26,11 @@ private[spark] object Tests {
 .longConf
 .createWithDefault(Runtime.getRuntime.maxMemory)
 
-  val TEST_SCHEDULE_INTERVAL =
-ConfigBuilder("spark.testing.dynamicAllocation.scheduleInterval")
-  .version("2.3.0")
-  .longConf
-  .createWithDefault(100)
+  val TEST_DYNAMIC_ALLOCATION_SCHEDULE_ENABLED =

Review comment:
   when will we set this conf?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jiangxb1987 commented on a change in pull request #29732: [SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the sa

2020-09-16 Thread GitBox


jiangxb1987 commented on a change in pull request #29732:
URL: https://github.com/apache/spark/pull/29732#discussion_r489980087



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala
##
@@ -189,30 +189,23 @@ class BarrierTaskContextSuite extends SparkFunSuite with 
LocalSparkContext with
 
   test("throw exception if the number of barrier() calls are not the same on 
every task") {
 initLocalClusterSparkContext()
-sc.conf.set("spark.barrier.sync.timeout", "1")
+sc.conf.set("spark.barrier.sync.timeout", "3")
 val rdd = sc.makeRDD(1 to 10, 4)
 val rdd2 = rdd.barrier().mapPartitions { it =>
   val context = BarrierTaskContext.get()
-  try {
-if (context.taskAttemptId == 0) {

Review comment:
   We still need to ensure that, the number of `barrier()` calls are 
different on all the tasks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489977682



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -2206,39 +2206,63 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
-  test("SPARK-21912 ORC/Parquet table should not create invalid column names") 
{
+  test("SPARK-21912 Parquet table should not create invalid column names") {
 Seq(" ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
-  Seq("ORC", "PARQUET").foreach { source =>

Review comment:
   @jzc928 . ~Sorry, but Apache Parquet is the defecto standard in Apache 
Spark. Why do we need to support something which Apache Parquet doesn't 
support?~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489979937



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##
@@ -233,6 +233,19 @@ class FileBasedDataSourceSuite extends QueryTest
 }
   }
 
+  test("column name supports special characters using orc") {
+val format = "orc"
+Seq("$", " ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { 
name =>
+  withTempDir { dir =>
+val dataDir = new File(dir, "file").getCanonicalPath
+Seq(1).toDF(name).write.orc(dataDir)
+val schema = spark.read.orc(dataDir).schema
+assert(schema.size == 1)
+assertResult(name)(schema(0).name)
+  }
+}
+  }
+

Review comment:
   Please change like this.
   ```scala
 Seq("json", "orc").foreach { format =>
   test(s"SPARK-32889: column name supports special characters using 
$format") {
 Seq("$", " ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { 
name =>
   withTempDir { dir =>
 val dataDir = new File(dir, "file").getCanonicalPath
 Seq(1).toDF(name).write.format(format).save(dataDir)
 val schema = spark.read.format(format).load(dataDir).schema
 assert(schema.size == 1)
 assertResult(name)(schema.head.name)
   }
 }
   }
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489978849



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##
@@ -233,6 +233,19 @@ class FileBasedDataSourceSuite extends QueryTest
 }
   }
 
+  test("column name supports special characters using orc") {
+val format = "orc"
+Seq("$", " ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { 
name =>
+  withTempDir { dir =>
+val dataDir = new File(dir, "file").getCanonicalPath
+Seq(1).toDF(name).write.orc(dataDir)
+val schema = spark.read.orc(dataDir).schema
+assert(schema.size == 1)
+assertResult(name)(schema(0).name)

Review comment:
   `schema.head` instead of `schema(0)`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489977682



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -2206,39 +2206,63 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
-  test("SPARK-21912 ORC/Parquet table should not create invalid column names") 
{
+  test("SPARK-21912 Parquet table should not create invalid column names") {
 Seq(" ", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
-  Seq("ORC", "PARQUET").foreach { source =>

Review comment:
   @jzc928 . Sorry, but Apache Parquet is the defecto standard in Apache 
Spark. Why do we need to support something which Apache Parquet doesn't support?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29747: [SPARK-31848][CORE][TEST] DAGSchedulerSuite: Break down the very huge test file

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29747:
URL: https://github.com/apache/spark/pull/29747#issuecomment-693881706







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29747: [SPARK-31848][CORE][TEST] DAGSchedulerSuite: Break down the very huge test file

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29747:
URL: https://github.com/apache/spark/pull/29747#issuecomment-693881706







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29747: [SPARK-31848][CORE][TEST] DAGSchedulerSuite: Break down the very huge test file

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29747:
URL: https://github.com/apache/spark/pull/29747#issuecomment-693766657


   **[Test build #128786 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128786/testReport)**
 for PR 29747 at commit 
[`d4ce868`](https://github.com/apache/spark/commit/d4ce868465e671eda6eef3db042ac878be98b3d3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29747: [SPARK-31848][CORE][TEST] DAGSchedulerSuite: Break down the very huge test file

2020-09-16 Thread GitBox


SparkQA commented on pull request #29747:
URL: https://github.com/apache/spark/pull/29747#issuecomment-693877716


   **[Test build #128786 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128786/testReport)**
 for PR 29747 at commit 
[`d4ce868`](https://github.com/apache/spark/commit/d4ce868465e671eda6eef3db042ac878be98b3d3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489975451



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##
@@ -233,6 +233,19 @@ class FileBasedDataSourceSuite extends QueryTest
 }
   }
 
+  test("column name supports special characters using orc") {
+val format = "orc"

Review comment:
   This is unused.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29772: [SPARK-32900][CORE] Allow UnsafeExternalSorter to spill when there are nulls.

2020-09-16 Thread GitBox


viirya commented on pull request #29772:
URL: https://github.com/apache/spark/pull/29772#issuecomment-693863796


   > Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator 
has spilled already by checking whether upstream is an instance of 
UnsafeInMemorySorter.SortedIterator
   
   Can we update the description a bit?
   
   This is reading like Spark thinks `UnsafeExternalSorter.SpillableIterator` 
has spilled already if `upstream` is `UnsafeInMemorySorter.SortedIterator`. But 
it actually is Spark thinks `UnsafeExternalSorter.SpillableIterator` has 
spilled already if `upstream` is not `UnsafeInMemorySorter.SortedIterator`, 
right?
   
   ```scala
   if (!(upstream instanceof UnsafeInMemorySorter.SortedIterator && 
nextUpstream == null
 && numRecords > 0)) {
 return 0L;
   }
   ```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693855631







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693855631







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693803882


   **[Test build #128794 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128794/testReport)**
 for PR 29779 at commit 
[`bd35323`](https://github.com/apache/spark/commit/bd35323c39fdc66029d8be6768560e4b70a71fb3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693854205


   **[Test build #128794 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128794/testReport)**
 for PR 29779 at commit 
[`bd35323`](https://github.com/apache/spark/commit/bd35323c39fdc66029d8be6768560e4b70a71fb3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29761: [SPARK-32889][SQL] orc table column name supports special characters.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29761:
URL: https://github.com/apache/spark/pull/29761#discussion_r489969657



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
##
@@ -2206,39 +2206,63 @@ abstract class SQLQuerySuiteBase extends QueryTest with 
SQLTestUtils with TestHi
 }
   }
 
-  test("SPARK-21912 ORC/Parquet table should not create invalid column names") 
{
+  test("SPARK-21912 Parquet table should not create invalid column names") {

Review comment:
   Oh, I missed that you meant to change this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29780: [SPARK-32906][SQL] Struct field names should not change after normalizing floats

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29780:
URL: https://github.com/apache/spark/pull/29780#issuecomment-693833247







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29780: [SPARK-32906][SQL] Struct field names should not change after normalizing floats

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29780:
URL: https://github.com/apache/spark/pull/29780#issuecomment-693833247







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29780: [SPARK-32906][SQL] Struct field names should not change after normalizing floats

2020-09-16 Thread GitBox


SparkQA commented on pull request #29780:
URL: https://github.com/apache/spark/pull/29780#issuecomment-693832064


   **[Test build #128795 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128795/testReport)**
 for PR 29780 at commit 
[`1bf4f32`](https://github.com/apache/spark/commit/1bf4f32924df6a3a06623dfcd2e06b6749c6ebad).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gwax commented on a change in pull request #29720: [SPARK-32849][PYSPARK] Add default values for non-required keys when creating StructType

2020-09-16 Thread GitBox


gwax commented on a change in pull request #29720:
URL: https://github.com/apache/spark/pull/29720#discussion_r489967059



##
File path: python/pyspark/sql/types.py
##
@@ -305,7 +305,7 @@ def jsonValue(self):
 @classmethod
 def fromJson(cls, json):

Review comment:
   Unless there are plans to remove `.fromJson`, it is a publicly exposed 
interface and, I dare say, a rather useful one.
   
   JSON is currently the only schema definition structure that is a) human 
readable, b) machine readable without `exec`, and c) easy to generate with 
anything other than Python / Java.
   
   As far as I can tell, this PR:
   - Adds additional test coverage for an existing component
   - Makes an existing component more flexible for some use cases
   - Does not reduce any existing functionality
   
   Getting to use cases, I have frequently found value in providing a machine 
readable schema that can be validated with JSON schema and used as a unit and 
integration tests to verify expected schema against a SQL file.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu opened a new pull request #29780: [SPARK-32906][SQL] Struct field names should not change after normalizing floats

2020-09-16 Thread GitBox


maropu opened a new pull request #29780:
URL: https://github.com/apache/spark/pull/29780


   
   
   ### What changes were proposed in this pull request?
   
   This PR intends to fix a minor bug when normalizing floats for struct types;
   ```
   scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
   scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
   scala> val agg = df.distinct()
   scala> agg.explain()
   == Physical Plan ==
   *(2) HashAggregate(keys=[k#40], functions=[])
   +- Exchange hashpartitioning(k#40, 200), true, [id=#62]
  +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if 
(isnull(k#40)) null else named_struct(col1, 
knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], 
functions=[])
 +- *(1) LocalTableScan [k#40]
   
   scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: 
HashAggregateExec => a.output.head }
   scala> aggOutput.foreach { attr => println(attr.prettyJson) }
   ### Final Aggregate ###
   [ {
 "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
 "num-children" : 0,
 "name" : "k",
 "dataType" : {
   "type" : "struct",
   "fields" : [ {
 "name" : "_1",
   ^^^
 "type" : "double",
 "nullable" : false,
 "metadata" : { }
   } ]
 },
 "nullable" : true,
 "metadata" : { },
 "exprId" : {
   "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
   "id" : 40,
   "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
 },
 "qualifier" : [ ]
   } ]
   
   ### Partial Aggregate ###
   [ {
 "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
 "num-children" : 0,
 "name" : "k",
 "dataType" : {
   "type" : "struct",
   "fields" : [ {
 "name" : "col1",
   
 "type" : "double",
 "nullable" : true,
 "metadata" : { }
   } ]
 },
 "nullable" : true,
 "metadata" : { },
 "exprId" : {
   "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
   "id" : 40,
   "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
 },
 "qualifier" : [ ]
   } ]
   ```
   
   ### Why are the changes needed?
   
   bugfix.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #29749: [SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type

2020-09-16 Thread GitBox


ulysses-you commented on pull request #29749:
URL: https://github.com/apache/spark/pull/29749#issuecomment-693825397


   also cc @cloud-fan @dongjoon-hyun the similar issue with 
[#13930](https://github.com/apache/spark/pull/13930)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489964455



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -630,8 +639,18 @@ object AdaptiveSparkPlanExec {
   /**
* Apply a list of physical operator rules on a [[SparkPlan]].
*/
-  def applyPhysicalRules(plan: SparkPlan, rules: Seq[Rule[SparkPlan]]): 
SparkPlan = {
-rules.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
+  def applyPhysicalRules(
+  batchName: String,
+  plan: SparkPlan,
+  rules: Seq[Rule[SparkPlan]]): SparkPlan = {
+val planChangeLogger = new PlanChangeLogger[SparkPlan]()

Review comment:
   shall we create `PlanChangeLogger` instance only once? (keep it as a 
class member)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693817842







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693792818


   **[Test build #128792 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128792/testReport)**
 for PR 29779 at commit 
[`8d6634b`](https://github.com/apache/spark/commit/8d6634b50f06ab9259a470a5e3a3de46e616ed3f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693817842







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693816448


   **[Test build #128792 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128792/testReport)**
 for PR 29779 at commit 
[`8d6634b`](https://github.com/apache/spark/commit/8d6634b50f06ab9259a470a5e3a3de46e616ed3f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489964148



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -534,7 +542,8 @@ case class AdaptiveSparkPlanExec(
 logicalPlan.invalidateStatsCache()
 val optimized = optimizer.execute(logicalPlan)
 val sparkPlan = 
context.session.sessionState.planner.plan(ReturnAnswer(optimized)).next()
-val newPlan = applyPhysicalRules(sparkPlan, preprocessingRules ++ 
queryStagePreparationRules)
+val newPlan = applyPhysicalRules(
+  "AQE Preparations", sparkPlan, preprocessingRules ++ 
queryStagePreparationRules)

Review comment:
   `AQE replanning`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489963857



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -413,19 +416,24 @@ case class AdaptiveSparkPlanExec(
   }
 
   private def newQueryStage(e: Exchange): QueryStageExec = {
-val optimizedPlan = applyPhysicalRules(e.child, queryStageOptimizerRules)
+val optimizedPlan = applyPhysicalRules(
+  "AQE Physical Plan Optimization", e.child, queryStageOptimizerRules)

Review comment:
   `AQE query stage optimization`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693813335







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489963773



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -231,7 +232,9 @@ case class AdaptiveSparkPlanExec(
 
   // Run the final plan when there's no more unfinished stages.
   currentPhysicalPlan = applyPhysicalRules(
-result.newPlan, queryStageOptimizerRules ++ postStageCreationRules)
+"AQE Physical Plan Optimization",

Review comment:
   `AQE final query stage optimization`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693813335







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


cloud-fan commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489963445



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -231,7 +232,9 @@ case class AdaptiveSparkPlanExec(
 

Review comment:
   I think they are different. `spark.sql.adaptive.logLevel` controls the 
logging of plan changes for each AQE re-optimization round. While this PR is to 
log plan change before/after each rule.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693780395


   **[Test build #128790 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128790/testReport)**
 for PR 29778 at commit 
[`78121a1`](https://github.com/apache/spark/commit/78121a1ce487666d540d047441831dea89f33e72).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


SparkQA commented on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693811537


   **[Test build #128790 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128790/testReport)**
 for PR 29778 at commit 
[`78121a1`](https://github.com/apache/spark/commit/78121a1ce487666d540d047441831dea89f33e72).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693805704







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693805704







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693803882


   **[Test build #128794 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128794/testReport)**
 for PR 29779 at commit 
[`bd35323`](https://github.com/apache/spark/commit/bd35323c39fdc66029d8be6768560e4b70a71fb3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29774: [SPARK-32902][SQL] Logging plan changes for AQE

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29774:
URL: https://github.com/apache/spark/pull/29774#discussion_r489957155



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -231,7 +232,9 @@ case class AdaptiveSparkPlanExec(
 

Review comment:
   qq: should we remove `spark.sql.adaptive.logLevel`, @cloud-fan, 
@maryannxue, and @maropu? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796763







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796763







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


SparkQA commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-693796417


   **[Test build #128793 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128793/testReport)**
 for PR 29703 at commit 
[`09997b7`](https://github.com/apache/spark/commit/09997b7c92d608ea675d86d9d6d28e641654dc9f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489952888



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -38,8 +38,36 @@ PySpark installation using `PyPI 
`_
 .. code-block:: bash
 
 pip install pyspark
-   
-Using Conda  
+
+For PySpark with different Hadoop and/or Hive, you can install it by using 
``HIVE_VERSION`` and ``HADOOP_VERSION`` environment variables as below:
+
+.. code-block:: bash
+
+HIVE_VERSION=2.3 pip install pyspark
+HADOOP_VERSION=2.7 pip install pyspark
+HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
+
+The default distribution has built-in Hadoop 3.2 and Hive 2.3. If users 
specify different versions, the pip installation automatically
+downloads a different version and use it in PySpark. Downloading it can take a 
while depending on the network and the mirror chosen.
+It is recommended to use `-v` option in `pip` to track the installation and 
download status.
+
+.. code-block:: bash
+
+HADOOP_VERSION=2.7 pip install pyspark -v
+
+Supported versions are as below:
+
+== == 
==
+``HADOOP_VERSION`` \\ ``HIVE_VERSION`` 1.2
2.3 (default)
+== == 
==
+**2.7**O  O
+**3.2 (default)**  X  O
+**without**X  O
+== == 
==
+
+Note that this installation of PySpark with different versions of Hadoop and 
Hive is experimental. It can change or be removed betweem minor releases.

Review comment:
   ```suggestion
   Note that this installation of PySpark with different versions of Hadoop and 
Hive is experimental. It can change or be removed between minor releases.
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693793104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693793104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


SparkQA commented on pull request #29779:
URL: https://github.com/apache/spark/pull/29779#issuecomment-693792818


   **[Test build #128792 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128792/testReport)**
 for PR 29779 at commit 
[`8d6634b`](https://github.com/apache/spark/commit/8d6634b50f06ab9259a470a5e3a3de46e616ed3f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-16 Thread GitBox


HyukjinKwon opened a new pull request #29779:
URL: https://github.com/apache/spark/pull/29779


   ### What changes were proposed in this pull request?
   
   This PR:
   - Rephrases some wordings in installation guide to avoid using the terms 
that can be potentially ambiguous such as "different favors"
   - Document extra dependency installation `pip install pyspark[sql]`
   - Use the link that corresponds to the released version. e.g.) 
https://spark.apache.org/docs/latest/building-spark.html vs 
https://spark.apache.org/docs/3.0.0/building-spark.html
   - Add some more details
   
   ### Why are the changes needed?
   
   To improve installation guide.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it updates the user-facing installation guide.
   
   ### How was this patch tested?
   
   Manually built the doc and tested.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489937250



##
File path: python/pyspark/install.py
##
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+("without-hadoop", "hive1.2"),
+("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+if hive_version == "hive1.2":
+return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+else:
+return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+"""
+Check the valid combinations of supported versions in Spark distributions.
+
+:param spark_version: Spark version. It should be X.X.X such as '3.0.0' or 
spark-3.0.0.
+:param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 
'hadoop2.7'.
+'without' and 'without-hadoop' are supported as special keywords for 
Hadoop free
+distribution.
+:param hive_version: Hive version. It should be X.X such as '1.2' or 
'hive1.2'.
+
+:return it returns fully-qualified versions of Spark, Hadoop and Hive in a 
tuple.
+For example, spark-3.0.0, hadoop3.2 and hive2.3.
+"""
+if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+spark_version = "spark-%s" % spark_version
+if not spark_version.startswith("spark-"):
+raise RuntimeError(
+"Spark version should start with 'spark-' prefix; however, "
+"got %s" % spark_version)
+
+if hadoop_version == "without":
+hadoop_version = "without-hadoop"

Review comment:
   It is verified below `if hadoop_version not in 
SUPPORTED_HADOOP_VERSIONS:` later.  There's a test case here: 
https://github.com/apache/spark/pull/29703/files/033a33ee515b95342e8c5a74e63054d915661579#diff-e23af4eb5cc3bf6af4bc26cb801b7e84R69
 and 
https://github.com/apache/spark/pull/29703/files/033a33ee515b95342e8c5a74e63054d915661579#diff-e23af4eb5cc3bf6af4bc26cb801b7e84R88
   
   Users can specify the Hadoop and Hive versions such as `hadoop3.2` and 
`hive2.3` as well but I didn't document this. These keywords are actually 
ported from SparkR as are `SparkR::install.spark`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693705023


   **[Test build #128781 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128781/testReport)**
 for PR 29776 at commit 
[`a0c8466`](https://github.com/apache/spark/commit/a0c84664d56c484b2e5c6a9ced966d6a77760633).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693788093







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693788093







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29776: [SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions

2020-09-16 Thread GitBox


SparkQA commented on pull request #29776:
URL: https://github.com/apache/spark/pull/29776#issuecomment-693787595


   **[Test build #128781 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128781/testReport)**
 for PR 29776 at commit 
[`a0c8466`](https://github.com/apache/spark/commit/a0c84664d56c484b2e5c6a9ced966d6a77760633).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class ExprWithEvaluatedState() extends LeafExpression with 
CodegenFallback `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29762: [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms.

2020-09-16 Thread GitBox


dongjoon-hyun commented on a change in pull request #29762:
URL: https://github.com/apache/spark/pull/29762#discussion_r489932308



##
File path: 
common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_x86_32.java
##
@@ -92,8 +96,10 @@ private static int hashBytesByInt(Object base, long offset, 
int lengthInBytes, i
 int h1 = seed;
 for (int i = 0; i < lengthInBytes; i += 4) {
   int halfWord = Platform.getInt(base, offset + i);
-  int k1 = mixK1(halfWord);
-  h1 = mixH1(h1, k1);
+  if (isBigEndian) {

Review comment:
   cc @rednaxelafx 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


viirya commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489919482



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -38,8 +38,36 @@ PySpark installation using `PyPI 
`_
 .. code-block:: bash
 
 pip install pyspark
-   
-Using Conda  
+
+For PySpark with different Hadoop and/or Hive, you can install it by using 
``HIVE_VERSION`` and ``HADOOP_VERSION`` environment variables as below:
+
+.. code-block:: bash
+
+HIVE_VERSION=2.3 pip install pyspark
+HADOOP_VERSION=2.7 pip install pyspark
+HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
+
+The default distribution has built-in Hadoop 3.2 and Hive 2.3. If users 
specify different versions, the pip installation automatically
+downloads a different version and use it in PySpark. Downloading it can take a 
while depending on the network and the mirror chosen.
+It is recommended to use `-v` option in `pip` to track the installation and 
download status.
+
+.. code-block:: bash
+
+HADOOP_VERSION=2.7 pip install pyspark -v
+
+Supported versions are as below:
+
+== == 
==
+``HADOOP_VERSION`` \\ ``HIVE_VERSION`` 1.2
2.3 (default)
+== == 
==
+**2.7**O  O
+**3.2 (default)**  X  O
+**without**X  O
+== == 
==
+
+Note that this installation of PySpark with different versions of Hadoop and 
Hive is experimental. It can change or be removed betweem minor releases.

Review comment:
   betweem -> between

##
File path: python/pyspark/install.py
##
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import re
+import tarfile
+import traceback
+import urllib.request
+from shutil import rmtree
+# NOTE that we shouldn't import pyspark here because this is used in
+# setup.py, and assume there's no PySpark imported.
+
+DEFAULT_HADOOP = "hadoop3.2"
+DEFAULT_HIVE = "hive2.3"
+SUPPORTED_HADOOP_VERSIONS = ["hadoop2.7", "hadoop3.2", "without-hadoop"]
+SUPPORTED_HIVE_VERSIONS = ["hive1.2", "hive2.3"]
+UNSUPPORTED_COMBINATIONS = [
+("without-hadoop", "hive1.2"),
+("hadoop3.2", "hive1.2"),
+]
+
+
+def checked_package_name(spark_version, hadoop_version, hive_version):
+if hive_version == "hive1.2":
+return "%s-bin-%s-%s" % (spark_version, hadoop_version, hive_version)
+else:
+return "%s-bin-%s" % (spark_version, hadoop_version)
+
+
+def checked_versions(spark_version, hadoop_version, hive_version):
+"""
+Check the valid combinations of supported versions in Spark distributions.
+
+:param spark_version: Spark version. It should be X.X.X such as '3.0.0' or 
spark-3.0.0.
+:param hadoop_version: Hadoop version. It should be X.X such as '2.7' or 
'hadoop2.7'.
+'without' and 'without-hadoop' are supported as special keywords for 
Hadoop free
+distribution.
+:param hive_version: Hive version. It should be X.X such as '1.2' or 
'hive1.2'.
+
+:return it returns fully-qualified versions of Spark, Hadoop and Hive in a 
tuple.
+For example, spark-3.0.0, hadoop3.2 and hive2.3.
+"""
+if re.match("^[0-9]+\\.[0-9]+\\.[0-9]+$", spark_version):
+spark_version = "spark-%s" % spark_version
+if not spark_version.startswith("spark-"):
+raise RuntimeError(
+"Spark version should start with 'spark-' prefix; however, "
+"got %s" % spark_version)
+
+if hadoop_version == "without":
+hadoop_version = "without-hadoop"

Review comment:
   Is "without-hadoop" also supported as special keyword? Seems not see it 
is matched here?





[GitHub] [spark] maropu commented on a change in pull request #29092: [SPARK-32295][SQL] Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-09-16 Thread GitBox


maropu commented on a change in pull request #29092:
URL: https://github.com/apache/spark/pull/29092#discussion_r489922948



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -1847,3 +1848,25 @@ object OptimizeLimitZero extends Rule[LogicalPlan] {
   empty(ll)
   }
 }
+
+/**
+ * Generates filters for exploded expression, such that rows that would have 
been removed
+ * by this [[Generate]] can be removed earlier - before joins and in data 
sources.
+ */
+object InferFiltersFromGenerate extends Rule[LogicalPlan] {

Review comment:
   I thought first we might be able to do so, but it looks okay as it is on 
second thoughts.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693781556


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128791/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693781552


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tanelk commented on a change in pull request #29092: [SPARK-32295][SQL] Add not null and size > 0 filters before inner explode to benefit from predicate pushdown

2020-09-16 Thread GitBox


tanelk commented on a change in pull request #29092:
URL: https://github.com/apache/spark/pull/29092#discussion_r489919112



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -1847,3 +1848,25 @@ object OptimizeLimitZero extends Rule[LogicalPlan] {
   empty(ll)
   }
 }
+
+/**
+ * Generates filters for exploded expression, such that rows that would have 
been removed
+ * by this [[Generate]] can be removed earlier - before joins and in data 
sources.
+ */
+object InferFiltersFromGenerate extends Rule[LogicalPlan] {

Review comment:
   The new one does not use constraints to infer filters. I could rename 
the existing one to just `InferFilters` and then combine these two.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


SparkQA commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693781543


   **[Test build #128791 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128791/testReport)**
 for PR 29764 at commit 
[`af64be8`](https://github.com/apache/spark/commit/af64be83dd3b07b148d8c1886633729d6d06eec5).
* This patch **fails build dependency tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


SparkQA removed a comment on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693780444


   **[Test build #128791 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128791/testReport)**
 for PR 29764 at commit 
[`af64be8`](https://github.com/apache/spark/commit/af64be83dd3b07b148d8c1886633729d6d06eec5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693781552







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693780763







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693780763







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693780826







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


AmplabJenkins commented on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693780826







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


SparkQA commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693780444


   **[Test build #128791 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128791/testReport)**
 for PR 29764 at commit 
[`af64be8`](https://github.com/apache/spark/commit/af64be83dd3b07b148d8c1886633729d6d06eec5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


SparkQA commented on pull request #29778:
URL: https://github.com/apache/spark/pull/29778#issuecomment-693780395


   **[Test build #128790 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128790/testReport)**
 for PR 29778 at commit 
[`78121a1`](https://github.com/apache/spark/commit/78121a1ce487666d540d047441831dea89f33e72).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request #29778: [SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2

2020-09-16 Thread GitBox


zhengruifeng opened a new pull request #29778:
URL: https://github.com/apache/spark/pull/29778


   ### What changes were proposed in this pull request?
   simplify the aggregation by get `count` via `summary.count`
   
   
   ### Why are the changes needed?
   simplify the aggregation 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wzhfy commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-16 Thread GitBox


wzhfy commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-693779495


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29703: [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI

2020-09-16 Thread GitBox


HyukjinKwon commented on a change in pull request #29703:
URL: https://github.com/apache/spark/pull/29703#discussion_r489909606



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -38,8 +38,36 @@ PySpark installation using `PyPI 
`_
 .. code-block:: bash

Review comment:
   I am going to rewrite this page after this PR gets merged.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >