date:20200106

[spark] branch master updated (7a1a5db -> 866b7df)

2020-01-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7a1a5db  [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, 
maps, plus misc. constant factors
 add 866b7df  [SPARK-30335][SQL][DOCS] Add a note first, last, collect_list 
and collect_set can be non-deterministic in SQL function docs as well

No new revisions were added by this update.

Summary of changes:
 R/pkg/R/functions.R| 12 +++
 python/pyspark/sql/functions.py| 12 +++
 .../sql/catalyst/expressions/aggregate/First.scala |  4 +++
 .../sql/catalyst/expressions/aggregate/Last.scala  |  4 +++
 .../catalyst/expressions/aggregate/collect.scala   |  8 +
 .../scala/org/apache/spark/sql/functions.scala | 40 +++---
 6 files changed, 48 insertions(+), 32 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (93d3ab8 -> 7a1a5db)

2020-01-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 93d3ab8  [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in 
ParquetRowConverter
 add 7a1a5db  [SPARK-30414][SQL] ParquetRowConverter optimizations: arrays, 
maps, plus misc. constant factors

No new revisions were added by this update.

Summary of changes:
 .../datasources/parquet/ParquetRowConverter.scala  | 59 +++---
 1 file changed, 29 insertions(+), 30 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (da07615 -> 93d3ab8)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from da07615  [SPARK-30433][SQL] Make conflict attributes resolution more 
scalable in ResolveReferences
 add 93d3ab8  [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in 
ParquetRowConverter

No new revisions were added by this update.

Summary of changes:
 .../datasources/parquet/ParquetRowConverter.scala  | 30 +--
 .../datasources/parquet/ParquetIOSuite.scala   | 63 +-
 2 files changed, 89 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (da07615 -> 93d3ab8)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from da07615  [SPARK-30433][SQL] Make conflict attributes resolution more 
scalable in ResolveReferences
 add 93d3ab8  [SPARK-30338][SQL] Avoid unnecessary InternalRow copies in 
ParquetRowConverter

No new revisions were added by this update.

Summary of changes:
 .../datasources/parquet/ParquetRowConverter.scala  | 30 +--
 .../datasources/parquet/ParquetIOSuite.scala   | 63 +-
 2 files changed, 89 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (17881a4 -> da07615)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 17881a4  [SPARK-19784][SPARK-25403][SQL] Refresh the table even table 
stats is empty
 add da07615  [SPARK-30433][SQL] Make conflict attributes resolution more 
scalable in ResolveReferences

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 46 --
 1 file changed, 25 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (17881a4 -> da07615)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 17881a4  [SPARK-19784][SPARK-25403][SQL] Refresh the table even table 
stats is empty
 add da07615  [SPARK-30433][SQL] Make conflict attributes resolution more 
scalable in ResolveReferences

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 46 --
 1 file changed, 25 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3ba175ef -> 17881a4)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that 
UserDefinedFunction's constructor is private
 add 17881a4  [SPARK-19784][SPARK-25403][SQL] Refresh the table even table 
stats is empty

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/command/CommandUtils.scala |  3 ++
 .../spark/sql/execution/command/DDLSuite.scala | 33 +-
 .../spark/sql/hive/HiveParquetMetastoreSuite.scala | 10 ---
 .../spark/sql/hive/orc/HiveOrcQuerySuite.scala |  4 +--
 4 files changed, 43 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3ba175ef -> 17881a4)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that 
UserDefinedFunction's constructor is private
 add 17881a4  [SPARK-19784][SPARK-25403][SQL] Refresh the table even table 
stats is empty

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/command/CommandUtils.scala |  3 ++
 .../spark/sql/execution/command/DDLSuite.scala | 33 +-
 .../spark/sql/hive/HiveParquetMetastoreSuite.scala | 10 ---
 .../spark/sql/hive/orc/HiveOrcQuerySuite.scala |  4 +--
 4 files changed, 43 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (88542bc -> 3ba175ef)

2020-01-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 88542bc  [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to 
dense arrays
 add 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that 
UserDefinedFunction's constructor is private

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/udf.py | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (88542bc -> 3ba175ef)

2020-01-06 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 88542bc  [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to 
dense arrays
 add 3ba175ef [SPARK-30430][PYTHON][DOCS] Add a note that 
UserDefinedFunction's constructor is private

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/udf.py | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays

2020-01-06 Thread meng

This is an automated email from the ASF dual-hosted git repository.

meng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 88542bc  [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to 
dense arrays
88542bc is described below

commit 88542bc3d9e506b1a0e852f3e9c632920d3fe553
Author: WeichenXu 
AuthorDate: Mon Jan 6 16:18:51 2020 -0800

[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays

### What changes were proposed in this pull request?

PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```

### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a 
DataFrame into dense arrays, an efficient approach is to do that in JVM. 
However, it requires PySpark user to write Scala code and register it as a UDF. 
Often this is infeasible for a pure python project.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #26910 from WeichenXu123/vector_to_array.

Authored-by: WeichenXu 
Signed-off-by: Xiangrui Meng 
---
 dev/sparktestsupport/modules.py|  1 +
 .../main/scala/org/apache/spark/ml/functions.scala | 48 +++
 .../scala/org/apache/spark/ml/FunctionsSuite.scala | 65 +
 python/docs/pyspark.ml.rst |  8 +++
 python/pyspark/ml/functions.py | 68 ++
 5 files changed, 190 insertions(+)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 1443584..4179359 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -460,6 +460,7 @@ pyspark_ml = Module(
 "pyspark.ml.evaluation",
 "pyspark.ml.feature",
 "pyspark.ml.fpm",
+"pyspark.ml.functions",
 "pyspark.ml.image",
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
diff --git a/mllib/src/main/scala/org/apache/spark/ml/functions.scala 
b/mllib/src/main/scala/org/apache/spark/ml/functions.scala
new file mode 100644
index 000..1faf562
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/functions.scala
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.mllib.linalg.{Vector => OldVector}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.functions.udf
+
+// scalastyle:off
+@Since("3.0.0")
+object functions {
+// scalastyle:on
+
+  private val vectorToArrayUdf = udf { vec: Any =>
+vec match {
+  case v: Vector => v.toArray
+  case v: OldVector => v.toArray
+  case v => throw new IllegalArgumentException(
+"function vector_to_array requires a non-null input argument and input 
type must be " +
+"`org.apache.spark.ml.linalg.Vector` or 
`org.apache.spark.mllib.linalg.Vector`, " +
+s"but got ${ if (v == null) "null" else v.getClass.getName }.")
+}
+  }.asNonNullable()
+
+  /**
+   * Converts a column of MLlib sparse/dense vectors into a column of dense 
arrays.
+   *
+   * @since 3.0.0
+   */
+  def vector_to_array(v: Column): Column = vectorToArrayUdf(v)
+}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala
new file mode 100644
index 000..2f5062c
--- /dev/null
+++ b/mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a cop

[spark] branch master updated (604d679 -> 895e572)

2020-01-06 Thread vanzin

This is an automated email from the ASF dual-hosted git repository.

vanzin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 604d679  [SPARK-30226][SQL] Remove withXXX functions in WriteBuilder
 add 895e572  [SPARK-30313][CORE] Ensure EndpointRef is available 
MasterWebUI/WorkerPage

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/rpc/netty/Dispatcher.scala| 35 ++
 1 file changed, 23 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3eade74 -> 604d679)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3eade74  [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use 
ScalaSubquery to optimize perf
 add 604d679  [SPARK-30226][SQL] Remove withXXX functions in WriteBuilder

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/v2/avro/AvroTable.scala   |  6 ++--
 .../spark/sql/v2/avro/AvroWriteBuilder.scala   |  8 ++---
 .../spark/sql/kafka010/KafkaSourceProvider.scala   | 12 +++-
 .../spark/sql/connector/catalog/StagedTable.java   | 10 +++
 .../sql/connector/catalog/StagingTableCatalog.java |  8 ++---
 .../spark/sql/connector/catalog/SupportsWrite.java |  8 ++---
 ...hysicalWriteInfo.java => LogicalWriteInfo.java} | 25 
 .../spark/sql/connector/write/WriteBuilder.java| 23 ---
 ...teInfoImpl.scala => LogicalWriteInfoImpl.scala} |  8 -
 .../apache/spark/sql/connector/InMemoryTable.scala |  4 +--
 .../connector/StagingInMemoryTableCatalog.scala|  6 ++--
 .../datasources/noop/NoopDataSource.scala  |  4 +--
 .../datasources/v2/FileWriteBuilder.scala  | 22 --
 .../datasources/v2/V1FallbackWriters.scala | 11 ---
 .../datasources/v2/WriteToDataSourceV2Exec.scala   | 34 +-
 .../execution/datasources/v2/csv/CSVTable.scala|  6 ++--
 .../datasources/v2/csv/CSVWriteBuilder.scala   |  8 ++---
 .../execution/datasources/v2/json/JsonTable.scala  |  6 ++--
 .../datasources/v2/json/JsonWriteBuilder.scala |  8 ++---
 .../execution/datasources/v2/orc/OrcTable.scala|  6 ++--
 .../datasources/v2/orc/OrcWriteBuilder.scala   |  8 ++---
 .../datasources/v2/parquet/ParquetTable.scala  |  6 ++--
 .../v2/parquet/ParquetWriteBuilder.scala   |  9 +++---
 .../execution/datasources/v2/text/TextTable.scala  |  6 ++--
 .../datasources/v2/text/TextWriteBuilder.scala |  8 ++---
 .../sql/execution/streaming/StreamExecution.scala  | 10 ---
 .../spark/sql/execution/streaming/console.scala| 13 +++--
 .../streaming/sources/ForeachWriterTable.scala | 12 ++--
 .../sql/execution/streaming/sources/memory.scala   | 12 ++--
 .../connector/FileDataSourceV2FallBackSuite.scala  |  4 +--
 .../sql/connector/SimpleWritableDataSource.scala   | 14 -
 .../spark/sql/connector/V1WriteFallbackSuite.scala | 14 +
 .../execution/datasources/v2/FileTableSuite.scala  |  4 +--
 .../sources/StreamingDataSourceV2Suite.scala   |  4 +--
 34 files changed, 161 insertions(+), 186 deletions(-)
 copy 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/{PhysicalWriteInfo.java
 => LogicalWriteInfo.java} (56%)
 copy 
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/{PhysicalWriteInfoImpl.scala
 => LogicalWriteInfoImpl.scala} (76%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bc16bb1 -> 3eade74)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bc16bb1  [SPARK-30426][SS][DOC] Fix the disorder of 
structured-streaming-kafka-integration page
 add 3eade74  [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use 
ScalaSubquery to optimize perf

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/expressions/subquery.scala  | 24 ++
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |  1 +
 .../sql/catalyst/optimizer/finishAnalysis.scala| 15 ++
 .../spark/sql/catalyst/optimizer/subquery.scala|  3 ++-
 .../org/apache/spark/sql/CachedTableSuite.scala| 15 +++---
 .../scala/org/apache/spark/sql/SubquerySuite.scala |  4 ++--
 6 files changed, 52 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bc16bb1 -> 3eade74)

2020-01-06 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bc16bb1  [SPARK-30426][SS][DOC] Fix the disorder of 
structured-streaming-kafka-integration page
 add 3eade74  [SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use 
ScalaSubquery to optimize perf

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/expressions/subquery.scala  | 24 ++
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |  1 +
 .../sql/catalyst/optimizer/finishAnalysis.scala| 15 ++
 .../spark/sql/catalyst/optimizer/subquery.scala|  3 ++-
 .../org/apache/spark/sql/CachedTableSuite.scala| 15 +++---
 .../scala/org/apache/spark/sql/SubquerySuite.scala |  4 ++--
 6 files changed, 52 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7a1a5db -> 866b7df)

[spark] branch master updated (93d3ab8 -> 7a1a5db)

[spark] branch master updated (da07615 -> 93d3ab8)

[spark] branch master updated (da07615 -> 93d3ab8)

[spark] branch master updated (17881a4 -> da07615)

[spark] branch master updated (17881a4 -> da07615)

[spark] branch master updated (3ba175ef -> 17881a4)

[spark] branch master updated (3ba175ef -> 17881a4)

[spark] branch master updated (88542bc -> 3ba175ef)

[spark] branch master updated (88542bc -> 3ba175ef)

[spark] branch master updated: [SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays

[spark] branch master updated (604d679 -> 895e572)

[spark] branch master updated (3eade74 -> 604d679)

[spark] branch master updated (bc16bb1 -> 3eade74)

[spark] branch master updated (bc16bb1 -> 3eade74)

15 matches

Site Navigation

Mail list logo

Footer information