[spark] branch master updated (7a1a5db -> 866b7df)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7a1a5db [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors add 866b7df [SPARK-30335][SQL][DOCS] Add a note first, last, collect_list and collect_set can be non-deterministic in SQL function docs as well No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R| 12 +++ python/pyspark/sql/functions.py| 12 +++ .../sql/catalyst/expressions/aggregate/First.scala | 4 +++ .../sql/catalyst/expressions/aggregate/Last.scala | 4 +++ .../catalyst/expressions/aggregate/collect.scala | 8 + .../scala/org/apache/spark/sql/functions.scala | 40 +++--- 6 files changed, 48 insertions(+), 32 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (93d3ab8 -> 7a1a5db)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 93d3ab8 [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter add 7a1a5db [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, maps, plus misc. constant factors No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 59 +++--- 1 file changed, 29 insertions(+), 30 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (da07615 -> 93d3ab8)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from da07615 [SPARK-30433][SQL] Make conflict attributes resolution more scalable in ResolveReferences add 93d3ab8 [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 30 +-- .../datasources/parquet/ParquetIOSuite.scala | 63 +- 2 files changed, 89 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (da07615 -> 93d3ab8)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from da07615 [SPARK-30433][SQL] Make conflict attributes resolution more scalable in ResolveReferences add 93d3ab8 [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in ParquetRowConverter No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 30 +-- .../datasources/parquet/ParquetIOSuite.scala | 63 +- 2 files changed, 89 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (17881a4 -> da07615)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 17881a4 [SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty add da07615 [SPARK-30433][SQL] Make conflict attributes resolution more scalable in ResolveReferences No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 46 -- 1 file changed, 25 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (17881a4 -> da07615)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 17881a4 [SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty add da07615 [SPARK-30433][SQL] Make conflict attributes resolution more scalable in ResolveReferences No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 46 -- 1 file changed, 25 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3ba175ef -> 17881a4)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that UserDefinedFunction's constructor is private add 17881a4 [SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/CommandUtils.scala | 3 ++ .../spark/sql/execution/command/DDLSuite.scala | 33 +- .../spark/sql/hive/HiveParquetMetastoreSuite.scala | 10 --- .../spark/sql/hive/orc/HiveOrcQuerySuite.scala | 4 +-- 4 files changed, 43 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3ba175ef -> 17881a4)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that UserDefinedFunction's constructor is private add 17881a4 [SPARK-19784][SPARK-25403][SQL] Refresh the table even table stats is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/command/CommandUtils.scala | 3 ++ .../spark/sql/execution/command/DDLSuite.scala | 33 +- .../spark/sql/hive/HiveParquetMetastoreSuite.scala | 10 --- .../spark/sql/hive/orc/HiveOrcQuerySuite.scala | 4 +-- 4 files changed, 43 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (88542bc -> 3ba175ef)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 88542bc [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays add 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that UserDefinedFunction's constructor is private No new revisions were added by this update. Summary of changes: python/pyspark/sql/udf.py | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (88542bc -> 3ba175ef)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 88542bc [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays add 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that UserDefinedFunction's constructor is private No new revisions were added by this update. Summary of changes: python/pyspark/sql/udf.py | 4 1 file changed, 4 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays
This is an automated email from the ASF dual-hosted git repository. meng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 88542bc [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays 88542bc is described below commit 88542bc3d9e506b1a0e852f3e9c632920d3fe553 Author: WeichenXu AuthorDate: Mon Jan 6 16:18:51 2020 -0800 [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays ### What changes were proposed in this pull request? PySpark UDF to convert MLlib vectors to dense arrays. Example: ``` from pyspark.ml.functions import vector_to_array df.select(vector_to_array(col("features")) ``` ### Why are the changes needed? If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #26910 from WeichenXu123/vector_to_array. Authored-by: WeichenXu Signed-off-by: Xiangrui Meng --- dev/sparktestsupport/modules.py| 1 + .../main/scala/org/apache/spark/ml/functions.scala | 48 +++ .../scala/org/apache/spark/ml/FunctionsSuite.scala | 65 + python/docs/pyspark.ml.rst | 8 +++ python/pyspark/ml/functions.py | 68 ++ 5 files changed, 190 insertions(+) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 1443584..4179359 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -460,6 +460,7 @@ pyspark_ml = Module( "pyspark.ml.evaluation", "pyspark.ml.feature", "pyspark.ml.fpm", +"pyspark.ml.functions", "pyspark.ml.image", "pyspark.ml.linalg.__init__", "pyspark.ml.recommendation", diff --git a/mllib/src/main/scala/org/apache/spark/ml/functions.scala b/mllib/src/main/scala/org/apache/spark/ml/functions.scala new file mode 100644 index 000..1faf562 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/functions.scala @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.mllib.linalg.{Vector => OldVector} +import org.apache.spark.sql.Column +import org.apache.spark.sql.functions.udf + +// scalastyle:off +@Since("3.0.0") +object functions { +// scalastyle:on + + private val vectorToArrayUdf = udf { vec: Any => +vec match { + case v: Vector => v.toArray + case v: OldVector => v.toArray + case v => throw new IllegalArgumentException( +"function vector_to_array requires a non-null input argument and input type must be " + +"`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`, " + +s"but got ${ if (v == null) "null" else v.getClass.getName }.") +} + }.asNonNullable() + + /** + * Converts a column of MLlib sparse/dense vectors into a column of dense arrays. + * + * @since 3.0.0 + */ + def vector_to_array(v: Column): Column = vectorToArrayUdf(v) +} diff --git a/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala new file mode 100644 index 000..2f5062c --- /dev/null +++ b/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a cop
[spark] branch master updated (604d679 -> 895e572)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 604d679 [SPARK-30226][SQL] Remove withXXX functions in WriteBuilder add 895e572 [SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage No new revisions were added by this update. Summary of changes: .../org/apache/spark/rpc/netty/Dispatcher.scala| 35 ++ 1 file changed, 23 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3eade74 -> 604d679)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3eade74 [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf add 604d679 [SPARK-30226][SQL] Remove withXXX functions in WriteBuilder No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/v2/avro/AvroTable.scala | 6 ++-- .../spark/sql/v2/avro/AvroWriteBuilder.scala | 8 ++--- .../spark/sql/kafka010/KafkaSourceProvider.scala | 12 +++- .../spark/sql/connector/catalog/StagedTable.java | 10 +++ .../sql/connector/catalog/StagingTableCatalog.java | 8 ++--- .../spark/sql/connector/catalog/SupportsWrite.java | 8 ++--- ...hysicalWriteInfo.java => LogicalWriteInfo.java} | 25 .../spark/sql/connector/write/WriteBuilder.java| 23 --- ...teInfoImpl.scala => LogicalWriteInfoImpl.scala} | 8 - .../apache/spark/sql/connector/InMemoryTable.scala | 4 +-- .../connector/StagingInMemoryTableCatalog.scala| 6 ++-- .../datasources/noop/NoopDataSource.scala | 4 +-- .../datasources/v2/FileWriteBuilder.scala | 22 -- .../datasources/v2/V1FallbackWriters.scala | 11 --- .../datasources/v2/WriteToDataSourceV2Exec.scala | 34 +- .../execution/datasources/v2/csv/CSVTable.scala| 6 ++-- .../datasources/v2/csv/CSVWriteBuilder.scala | 8 ++--- .../execution/datasources/v2/json/JsonTable.scala | 6 ++-- .../datasources/v2/json/JsonWriteBuilder.scala | 8 ++--- .../execution/datasources/v2/orc/OrcTable.scala| 6 ++-- .../datasources/v2/orc/OrcWriteBuilder.scala | 8 ++--- .../datasources/v2/parquet/ParquetTable.scala | 6 ++-- .../v2/parquet/ParquetWriteBuilder.scala | 9 +++--- .../execution/datasources/v2/text/TextTable.scala | 6 ++-- .../datasources/v2/text/TextWriteBuilder.scala | 8 ++--- .../sql/execution/streaming/StreamExecution.scala | 10 --- .../spark/sql/execution/streaming/console.scala| 13 +++-- .../streaming/sources/ForeachWriterTable.scala | 12 ++-- .../sql/execution/streaming/sources/memory.scala | 12 ++-- .../connector/FileDataSourceV2FallBackSuite.scala | 4 +-- .../sql/connector/SimpleWritableDataSource.scala | 14 - .../spark/sql/connector/V1WriteFallbackSuite.scala | 14 + .../execution/datasources/v2/FileTableSuite.scala | 4 +-- .../sources/StreamingDataSourceV2Suite.scala | 4 +-- 34 files changed, 161 insertions(+), 186 deletions(-) copy sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/{PhysicalWriteInfo.java => LogicalWriteInfo.java} (56%) copy sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/{PhysicalWriteInfoImpl.scala => LogicalWriteInfoImpl.scala} (76%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bc16bb1 -> 3eade74)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bc16bb1 [SPARK-30426][SS][DOC] Fix the disorder of structured-streaming-kafka-integration page add 3eade74 [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/subquery.scala | 24 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../sql/catalyst/optimizer/finishAnalysis.scala| 15 ++ .../spark/sql/catalyst/optimizer/subquery.scala| 3 ++- .../org/apache/spark/sql/CachedTableSuite.scala| 15 +++--- .../scala/org/apache/spark/sql/SubquerySuite.scala | 4 ++-- 6 files changed, 52 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bc16bb1 -> 3eade74)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bc16bb1 [SPARK-30426][SS][DOC] Fix the disorder of structured-streaming-kafka-integration page add 3eade74 [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/subquery.scala | 24 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../sql/catalyst/optimizer/finishAnalysis.scala| 15 ++ .../spark/sql/catalyst/optimizer/subquery.scala| 3 ++- .../org/apache/spark/sql/CachedTableSuite.scala| 15 +++--- .../scala/org/apache/spark/sql/SubquerySuite.scala | 4 ++-- 6 files changed, 52 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org