[jira] [Updated] (SPARK-28587) JDBC data source's partition whereClause should support jdbc dialect

2019-07-31 Thread wyp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wyp updated SPARK-28587:

Description: 
When we use JDBC data source to search data from Phoenix, and use timestamp 
data type column for partitionColumn, e.g.
{code:java}
val url = "jdbc:phoenix:thin:url=localhost:8765;serialization=PROTOBUF"
val driver = "org.apache.phoenix.queryserver.client.Driver"

val df = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("fetchsize", "1000")
.option("numPartitions", "6")
.option("partitionColumn", "times")
.option("lowerBound", "2019-07-31 00:00:00")
.option("upperBound", "2019-08-01 00:00:00")
.option("dbtable", "test")
.load().select("id")

println(df.count())
{code}
there will throw AvaticaSqlException in phoenix:
{code:java}
org.apache.calcite.avatica.AvaticaSqlException: Error -1 (0) : while 
preparing SQL: SELECT 1 FROM search_info_test WHERE "TIMES" < '2019-07-31 
04:00:00' or "TIMES" is null
  at org.apache.calcite.avatica.Helper.createException(Helper.java:54)
  at org.apache.calcite.avatica.Helper.createException(Helper.java:41)
  at 
org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:368)
  at 
org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:299)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:300)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
java.lang.RuntimeException: org.apache.phoenix.schema.TypeMismatchException: 
ERROR 203 (22005): Type mismatch. TIMESTAMP and VARCHAR for "TIMES" < 
'2019-07-31 04:00:00'
  at org.apache.calcite.avatica.jdbc.JdbcMeta.propagate(JdbcMeta.java:700)
  at 
org.apache.calcite.avatica.jdbc.PhoenixJdbcMeta.prepare(PhoenixJdbcMeta.java:67)
  at org.apache.calcite.avatica.remote.LocalService.apply(LocalService.java:195)
  at 
org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1215)
  at 
org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1186)
  at 
org.apache.calcite.avatica.remote.AbstractHandler.apply(AbstractHandler.java:94)
  at 
org.apache.calcite.avatica.remote.ProtobufHandler.apply(ProtobufHandler.java:46)
  at 
org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:127)
  at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
  at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
  at org.eclipse.jetty.server.Server.handle(Server.java:534)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
  at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
  at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
  at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
  at java.lang.Thread.run(Thread.java:834)
{code}
the reason is because JDBC data source's partition whereClause doesn't support 
jdbc dialect. We should use jdbc dialect to compile '2019-07-31 04:00:00' to 
to_timestamp('2019-07-31 04:00:00')

  was:
When we use JDBC data source to search 

[jira] [Created] (SPARK-28587) JDBC data source's partition whereClause should support jdbc dialect

2019-07-31 Thread wyp (JIRA)
wyp created SPARK-28587:
---

 Summary: JDBC data source's partition whereClause should support 
jdbc dialect
 Key: SPARK-28587
 URL: https://issues.apache.org/jira/browse/SPARK-28587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: wyp


When we use JDBC data source to search data from Phoenix, and use timestamp 
data type column for partitionColumn, e.g.

{code:java}
val url = "jdbc:phoenix:thin:url=localhost:8765;serialization=PROTOBUF"
val driver = "org.apache.phoenix.queryserver.client.Driver"

val df = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("fetchsize", "1000")
.option("numPartitions", "6")
.option("partitionColumn", "times")
.option("lowerBound", "2019-07-31 00:00:00")
.option("upperBound", "2019-08-01 00:00:00")
.option("dbtable", "test")
.load().select("id")

println(df.count())
{code}
there will throw AvaticaSqlException in phoenix:

{code:java}
org.apache.calcite.avatica.AvaticaSqlException: Error -1 (0) : while 
preparing SQL: SELECT 1 FROM search_info_test WHERE "TIMES" < '2019-07-31 
04:00:00' or "TIMES" is null
  at org.apache.calcite.avatica.Helper.createException(Helper.java:54)
  at org.apache.calcite.avatica.Helper.createException(Helper.java:41)
  at 
org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:368)
  at 
org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:299)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:300)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
java.lang.RuntimeException: org.apache.phoenix.schema.TypeMismatchException: 
ERROR 203 (22005): Type mismatch. TIMESTAMP and VARCHAR for "TIMES" < 
'2019-07-31 04:00:00'
  at org.apache.calcite.avatica.jdbc.JdbcMeta.propagate(JdbcMeta.java:700)
  at 
org.apache.calcite.avatica.jdbc.PhoenixJdbcMeta.prepare(PhoenixJdbcMeta.java:67)
  at org.apache.calcite.avatica.remote.LocalService.apply(LocalService.java:195)
  at 
org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1215)
  at 
org.apache.calcite.avatica.remote.Service$PrepareRequest.accept(Service.java:1186)
  at 
org.apache.calcite.avatica.remote.AbstractHandler.apply(AbstractHandler.java:94)
  at 
org.apache.calcite.avatica.remote.ProtobufHandler.apply(ProtobufHandler.java:46)
  at 
org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:127)
  at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
  at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
  at org.eclipse.jetty.server.Server.handle(Server.java:534)
  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
  at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
  at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
  at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
  at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
  at java.lang.Thread.run(Thread.java:834)
{code}

the reason is because JDBC data source's 

[jira] [Resolved] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28153.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25286
[https://github.com/apache/spark/pull/25286]

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Comment: was deleted

(was: Issue resolved by pull request 25321
[https://github.com/apache/spark/pull/25321])

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Comment: was deleted

(was: Issue resolved by pull request 25286
[https://github.com/apache/spark/pull/25286])

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897779#comment-16897779
 ] 

Hyukjin Kwon commented on SPARK-28153:
--

Fixed in https://github.com/apache/spark/pull/24958

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28153:


Assignee: Hyukjin Kwon

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-28153:
--
  Assignee: (was: Hyukjin Kwon)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Fix Version/s: (was: 3.0.0)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28153:


Assignee: Hyukjin Kwon

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Fix Version/s: (was: 3.0.0)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28153.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25321
[https://github.com/apache/spark/pull/25321]

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-28153:
--
  Assignee: (was: Hyukjin Kwon)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28153.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25321
[https://github.com/apache/spark/pull/25321]

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Comment: was deleted

(was: Issue resolved by pull request 25321
[https://github.com/apache/spark/pull/25321])

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28586) Make merge-spark-pr script compatible with Python 3

2019-07-31 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28586:


 Summary: Make merge-spark-pr script compatible with Python 3
 Key: SPARK-28586
 URL: https://issues.apache.org/jira/browse/SPARK-28586
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


merge-spark-pr.py is not Python 3 compatible. Committers should be more used to 
it so I will do it separately.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27888.
--
Resolution: Won't Fix

> Python 2->3 migration guide for PySpark users
> -
>
> Key: SPARK-27888
> URL: https://issues.apache.org/jira/browse/SPARK-27888
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We might need a short Python 2->3 migration guide for PySpark users. It 
> doesn't need to be comprehensive given many Python 2->3 migration guides 
> around. We just need some pointers and list items that are specific to 
> PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-28153:
--
  Assignee: (was: Hyukjin Kwon)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28153:
-
Fix Version/s: (was: 3.0.0)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28153:


Assignee: Hyukjin Kwon

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897758#comment-16897758
 ] 

Hyukjin Kwon commented on SPARK-28153:
--

I am testing Python 3 compatibility of merigng script. Let me open and resolve. 
Please ignore this noise.

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28471) Formatting dates with negative years

2019-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28471:

Labels:   (was: correctness)

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28471) Formatting dates with negative years

2019-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28471:

Labels: correctness  (was: )

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28153:
--
Summary: Use AtomicReference at InputFileBlockHolder (to support 
input_file_name with Python UDF)  (was: input_file_name doesn't work with 
Python UDF in the same project)

> Use AtomicReference at InputFileBlockHolder (to support input_file_name with 
> Python UDF)
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28153:
--
Affects Version/s: 2.4.3

> input_file_name doesn't work with Python UDF in the same project
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28153:
--
Affects Version/s: 2.3.3

> input_file_name doesn't work with Python UDF in the same project
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24352.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 25318
[https://github.com/apache/spark/pull/25318]

> Flaky test: StandaloneDynamicAllocationSuite
> 
>
> Key: SPARK-24352
> URL: https://issues.apache.org/jira/browse/SPARK-24352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> From jenkins:
> [https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/]
>  
> {noformat}
> Error Message
> There is already an RpcEndpoint called CoarseGrainedScheduler
> Stacktrace
>   java.lang.IllegalArgumentException: There is already an RpcEndpoint 
> called CoarseGrainedScheduler
>   at 
> org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
> {noformat}
> This actually looks like a previous test is leaving some stuff running and 
> making this one fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-24352:
-

Assignee: Marcelo Vanzin

> Flaky test: StandaloneDynamicAllocationSuite
> 
>
> Key: SPARK-24352
> URL: https://issues.apache.org/jira/browse/SPARK-24352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> From jenkins:
> [https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/]
>  
> {noformat}
> Error Message
> There is already an RpcEndpoint called CoarseGrainedScheduler
> Stacktrace
>   java.lang.IllegalArgumentException: There is already an RpcEndpoint 
> called CoarseGrainedScheduler
>   at 
> org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
> {noformat}
> This actually looks like a previous test is leaving some stuff running and 
> making this one fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28564) Access history application defaults to the last attempt id

2019-07-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28564.

   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 25301
[https://github.com/apache/spark/pull/25301]

> Access history application defaults to the last attempt id
> --
>
> Key: SPARK-28564
> URL: https://issues.apache.org/jira/browse/SPARK-28564
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.0.0, 2.4.4
>
>
> When we set spark.history.ui.maxApplications to a small value, we can't get 
> some apps from the page search.
> If the url is spliced (http://localhost:18080/history/local-xxx), it can be 
> accessed if the app has no attempt.
> But in the case of multiple attempted apps, such a url cannot be accessed, 
> and the page displays Not Found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28564) Access history application defaults to the last attempt id

2019-07-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28564:
--

Assignee: dzcxzl

> Access history application defaults to the last attempt id
> --
>
> Key: SPARK-28564
> URL: https://issues.apache.org/jira/browse/SPARK-28564
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> When we set spark.history.ui.maxApplications to a small value, we can't get 
> some apps from the page search.
> If the url is spliced (http://localhost:18080/history/local-xxx), it can be 
> accessed if the app has no attempt.
> But in the case of multiple attempted apps, such a url cannot be accessed, 
> and the page displays Not Found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-07-31 Thread Pablo Langa Blanco (JIRA)
Pablo Langa Blanco created SPARK-28585:
--

 Summary: Improve WebUI DAG information: Add extra info to rdd from 
spark plan
 Key: SPARK-28585
 URL: https://issues.apache.org/jira/browse/SPARK-28585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


The mainly improve that i want to achieve is to help developers to explore the 
DAG information in the Web UI in complex flows. 

Sometimes is very dificult to know what part of your spark plan corresponds to 
the DAG you are looking.

 

This is an initial improvement only in one simple spark plan type (UnionExec). 

If you consider it a good idea, i want to extend it to other spark plans to 
improve the visualization iteratively.

 

More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28581:
-

Assignee: Yuming Wang

> Replace _FUNC_ in UDF ExpressionInfo
> 
>
> Key: SPARK-28581
> URL: https://issues.apache.org/jira/browse/SPARK-28581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This issue aims to move {{replaceFunctionName(usage: String, functionName: 
> String)}} from {{DescribeFunctionCommand}} to {{ExpressionInfo}} in order to 
> make {{ExpressionInfo}} returns actual name instead of placeholder. We can 
> get {{ExpressionInfo}}s directly through 
> {{SessionCatalog.lookupFunctionInfo}} API and get the real names.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28581.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25314
[https://github.com/apache/spark/pull/25314]

> Replace _FUNC_ in UDF ExpressionInfo
> 
>
> Key: SPARK-28581
> URL: https://issues.apache.org/jira/browse/SPARK-28581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to move {{replaceFunctionName(usage: String, functionName: 
> String)}} from {{DescribeFunctionCommand}} to {{ExpressionInfo}} in order to 
> make {{ExpressionInfo}} returns actual name instead of placeholder. We can 
> get {{ExpressionInfo}}s directly through 
> {{SessionCatalog.lookupFunctionInfo}} API and get the real names.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28581:
--
Description: This issue aims to move {{replaceFunctionName(usage: String, 
functionName: String)}} from {{DescribeFunctionCommand}} to {{ExpressionInfo}} 
in order to make {{ExpressionInfo}} returns actual name instead of placeholder. 
We can get {{ExpressionInfo}}s directly through 
{{SessionCatalog.lookupFunctionInfo}} API and get the real names.  (was: Moves 
{{replaceFunctionName(usage: String, functionName: String)}}from 
{{DescribeFunctionCommand}} to {{ExpressionInfo}}.)

> Replace _FUNC_ in UDF ExpressionInfo
> 
>
> Key: SPARK-28581
> URL: https://issues.apache.org/jira/browse/SPARK-28581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This issue aims to move {{replaceFunctionName(usage: String, functionName: 
> String)}} from {{DescribeFunctionCommand}} to {{ExpressionInfo}} in order to 
> make {{ExpressionInfo}} returns actual name instead of placeholder. We can 
> get {{ExpressionInfo}}s directly through 
> {{SessionCatalog.lookupFunctionInfo}} API and get the real names.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28333) NULLS FIRST for DESC and NULLS LAST for ASC

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28333.
---
Resolution: Won't Do

Please see the discussion and review comments on the PR.

> NULLS FIRST for DESC and NULLS LAST for ASC
> ---
>
> Key: SPARK-28333
> URL: https://issues.apache.org/jira/browse/SPARK-28333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> spark-sql> create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> spark-sql> select * from t1 order by val asc;
> NULL
> NULL
> 1
> 2
> 3
> spark-sql> select * from t1 order by val desc;
> 3
> 2
> 1
> NULL
> NULL
> {code}
> {code:sql}
> postgres=# create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> CREATE VIEW
> postgres=# select * from t1 order by val asc;
>  val
> -
>1
>2
>3
> (5 rows)
> postgres=# select * from t1 order by val desc;
>  val
> -
>3
>2
>1
> (5 rows)
> {code}
> https://www.postgresql.org/docs/11/queries-order.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28584) Flaky test: org.apache.spark.scheduler.TaskSchedulerImplSuite

2019-07-31 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-28584:
--

 Summary: Flaky test: 
org.apache.spark.scheduler.TaskSchedulerImplSuite
 Key: SPARK-28584
 URL: https://issues.apache.org/jira/browse/SPARK-28584
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


This is another of those tests that don't seem to fail in PRs here, but fail 
more often than we'd like in our build machines. In this case it fails in 
several different ways, e.g.:

{noformat}
org.scalatest.exceptions.TestFailedException: 
Map(org.apache.spark.scheduler.TaskSetManager$$EnhancerByMockitoWithCGLIB$$c676cf51@412f9d43
 -> 1550579875956) did not contain key 
org.apache.spark.scheduler.TaskSetManager$$EnhancerByMockitoWithCGLIB$$c676cf51@1945f15f
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite$$anonfun$21.apply(TaskSchedulerImplSuite.scala:635)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite$$anonfun$21.apply(TaskSchedulerImplSuite.scala:591)
{noformat}

Or:

{noformat}
The code passed to eventually never returned normally. Attempted 40 times over 
503.217543 milliseconds. Last failure message: tsm.isZombie was false.

Error message:
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 40 times over 503.217543 
milliseconds. Last failure message: tsm.isZombie was false.
  at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
  at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite.eventually(TaskSchedulerImplSuite.scala:44)
  at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite.eventually(TaskSchedulerImplSuite.scala:44)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite$$anonfun$18.apply(TaskSchedulerImplSuite.scala:543)
  at 
org.apache.spark.scheduler.TaskSchedulerImplSuite$$anonfun$18.apply(TaskSchedulerImplSuite.scala:511)
{noformat}

There's a race condition in the test that can cause these different failures.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table

2019-07-31 Thread Amit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897305#comment-16897305
 ] 

Amit commented on SPARK-25185:
--

Yes it seems to work, but with lot of caveats.. like the one above up until 2.3 
not sure of 2.4 though

> CBO rowcount statistics doesn't work for partitioned parquet external table
> ---
>
> Key: SPARK-25185
> URL: https://issues.apache.org/jira/browse/SPARK-25185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
> Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode 
> reading data from local file system
>Reporter: Amit
>Priority: Major
>
> Created a dummy partitioned data with partition column on string type col1=a 
> and col1=b
> added csv data-> read through spark -> created partitioned external table-> 
> msck repair table to load partition. Did analyze on all columns and partition 
> column as well.
> ~println(spark.sql("select * from test_p where 
> e='1a'").queryExecution.toStringWithStats)~
>  ~val op = spark.sql("select * from test_p where 
> e='1a'").queryExecution.optimizedPlan~
> // e is the partitioned column
>  ~val stat = op.stats(spark.sessionState.conf)~
>  ~print(stat.rowCount)~
>  
> Created the same way in parquet the rowcount comes up correctly in case of 
> csv but in parquet it shows as None.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28583) Subqueries should not call `onUpdatePlan` in Adaptive Query Execution

2019-07-31 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-28583:
---

 Summary: Subqueries should not call `onUpdatePlan` in Adaptive 
Query Execution
 Key: SPARK-28583
 URL: https://issues.apache.org/jira/browse/SPARK-28583
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maryann Xue


Subqueries do not have their own execution id, thus when calling 
{{AdaptiveSparkPlanExec.onUpdatePlan}}, it will actually get the 
{{QueryExecution}} instance of the main query, which is wasteful and 
problematic. It could cause issues like stack overflow or dead locks in some 
circumstances.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28582) Pyspark daemon exit failed when receive SIGTERM on py3.7

2019-07-31 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-28582:
--

 Summary: Pyspark daemon exit failed when receive SIGTERM on py3.7
 Key: SPARK-28582
 URL: https://issues.apache.org/jira/browse/SPARK-28582
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Weichen Xu


Pyspark daemon exit failed when receive SIGTERM on py3.7.

We can run test on py3.7 like
{code}
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
{code}

Will fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.

This issue happen on py3.7 but lower version python works fine.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28153) input_file_name doesn't work with Python UDF in the same project

2019-07-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28153.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 3.0.0

> input_file_name doesn't work with Python UDF in the same project
> 
>
> Key: SPARK-28153
> URL: https://issues.apache.org/jira/browse/SPARK-28153
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> from pyspark.sql.functions import udf, input_file_name
> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").select(udf(lambda x: x, "long")("id"), 
> input_file_name()).show()
> {code}
> {code}
> ++-+
> |(id)|input_file_name()|
> ++-+
> |   8| |
> |   5| |
> |   0| |
> |   9| |
> |   6| |
> |   2| |
> |   3| |
> |   4| |
> |   7| |
> |   1| |
> ++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28581:

Description: Moves {{replaceFunctionName(usage: String, functionName: 
String)}}from {{DescribeFunctionCommand}} to {{ExpressionInfo}}.  (was: This PR 
moves {{replaceFunctionName(usage: String, functionName: String)}}
from {{DescribeFunctionCommand}} to {{ExpressionInfo}}.)

> Replace _FUNC_ in UDF ExpressionInfo
> 
>
> Key: SPARK-28581
> URL: https://issues.apache.org/jira/browse/SPARK-28581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Moves {{replaceFunctionName(usage: String, functionName: String)}}from 
> {{DescribeFunctionCommand}} to {{ExpressionInfo}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28581:

Description: 
This PR moves {{replaceFunctionName(usage: String, functionName: String)}}
from {{DescribeFunctionCommand}} to {{ExpressionInfo}}.

> Replace _FUNC_ in UDF ExpressionInfo
> 
>
> Key: SPARK-28581
> URL: https://issues.apache.org/jira/browse/SPARK-28581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This PR moves {{replaceFunctionName(usage: String, functionName: String)}}
> from {{DescribeFunctionCommand}} to {{ExpressionInfo}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28581) Replace _FUNC_ in UDF ExpressionInfo

2019-07-31 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28581:
---

 Summary: Replace _FUNC_ in UDF ExpressionInfo
 Key: SPARK-28581
 URL: https://issues.apache.org/jira/browse/SPARK-28581
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-31 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga reopened SPARK-28547:
-

I did not see any solutions. 

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-31 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897194#comment-16897194
 ] 

antonkulaga commented on SPARK-28547:
-

[~maropu] I think I was quite clear: even describe works slow as hell. So the 
easiest way to reproduce is just to run describe on all numeric columns in 
GTEX. 

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27689) Error to execute hive views with spark

2019-07-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27689:

Comment: was deleted

(was: It seemed that this failure is caused by  PR-SPARK-18801,  
https://github.com/apache/spark/pull/16233.)

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, cast(start_date#108 as date) AS start_date#81, 
> cast(end_date#109 as date) AS end_date#82]. Attribute(s) with the same name 
> appear in the operation: 
> id_person,start_date,id_product,end_date,price,country,city. Please check if 
> the right attribute(s) are used.;;
> Project [id_person#0, id_product#1, country#2, city#3, price#4, start_date#5, 
> end_date#6, name#29, country#30, city#31, start_date#32, end_date#33, 
> id_product#104, country#105, city#106, price#107, 

[jira] [Issue Comment Deleted] (SPARK-27689) Error to execute hive views with spark

2019-07-31 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27689:

Comment: was deleted

(was: You can add an unit test into HiveSQLViewSuite.scala to reproduce it with 
the code below.
```
withTable("ta") {
  withView("va") {
withView("vb") {
  withView("vc") {
sql("CREATE TABLE ta (c1 STRING)")
sql("CREATE VIEW va(c1) AS SELECT * FROM ta")
sql("CREATE TEMPORARY VIEW vb AS SELECT a.c1 FROM va AS a")
sql("CREATE TEMPORARY VIEW vc AS SELECT a.c1 FROM vb AS a JOIN vb 
as b ON a.c1 = b.c1")
sql("SELECT a.c1 FROM vb as a JOIN vc as b ON a.c1 = b.c1")
  }
}
  }
}
```)

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, 

[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-07-31 Thread feiwang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897120#comment-16897120
 ] 

feiwang commented on SPARK-27689:
-

You can add an unit test into HiveSQLViewSuite.scala to reproduce it with the 
code below.
```
withTable("ta") {
  withView("va") {
withView("vb") {
  withView("vc") {
sql("CREATE TABLE ta (c1 STRING)")
sql("CREATE VIEW va(c1) AS SELECT * FROM ta")
sql("CREATE TEMPORARY VIEW vb AS SELECT a.c1 FROM va AS a")
sql("CREATE TEMPORARY VIEW vc AS SELECT a.c1 FROM vb AS a JOIN vb 
as b ON a.c1 = b.c1")
sql("SELECT a.c1 FROM vb as a JOIN vc as b ON a.c1 = b.c1")
  }
}
  }
}
```

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def get_population(tables, ref_date_str):
>   product = tables['product']
>   person_product = tables['person_product']
>   count_prod 
> =person_product.groupBy('id_product').agg(F.count('id_product').alias('count_prod'))
>   person_product_join = person_product.join(product,'id_product')
>   person_count = person_product_join.join(count_prod,'id_product')
>   final = person_product_join.join(person_count, 'id_person', 'left')
>   return final
> import pyspark.sql.functions as F
> import functools
> from pyspark.sql.functions import col
> from pyspark.sql.functions import add_months, lit, count, coalesce
> spark.sql('use schema_p')
> data_dict = read_tables(spark)
> data_dict
> population = get_population(data_dict, '2019-04-30')
> population.show()
> {code}
> *Error:*
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> File "", line 10, in get_population
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 
> 931, in join
> jdf = self._jdf.join(other._jdf, on, how)
> File 
> "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
> File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, 
> in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Resolved attribute(s) 
> id_person#103,start_date#108,id_product#104,end_date#109,price#107,country#105,city#106
>  missing from 
> price#4,id_product#1,start_date#5,end_date#6,id_person#0,city#3,country#2 in 
> operator !Project [cast(id_person#103 as int) AS id_person#76, 
> cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS 
> country#78, cast(city#106 as string) AS city#79, cast(price#107 as 
> decimal(38,8)) AS price#80, 

[jira] [Updated] (SPARK-28580) ANSI SQL: unique predicate

2019-07-31 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-28580:
---
Description: 
Format
{code:java}
 ::=
 UNIQUE {code}

  was:
Format
 ::=
UNIQUE 


> ANSI SQL: unique predicate
> --
>
> Key: SPARK-28580
> URL: https://issues.apache.org/jira/browse/SPARK-28580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Format
> {code:java}
>  ::=
>  UNIQUE {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28580) ANSI SQL: unique predicate

2019-07-31 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-28580:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-27764

> ANSI SQL: unique predicate
> --
>
> Key: SPARK-28580
> URL: https://issues.apache.org/jira/browse/SPARK-28580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Format
>  ::=
> UNIQUE 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28580) ANSI SQL: unique predicate

2019-07-31 Thread jiaan.geng (JIRA)
jiaan.geng created SPARK-28580:
--

 Summary: ANSI SQL: unique predicate
 Key: SPARK-28580
 URL: https://issues.apache.org/jira/browse/SPARK-28580
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


Format
 ::=
UNIQUE 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28580) ANSI SQL: unique predicate

2019-07-31 Thread jiaan.geng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896931#comment-16896931
 ] 

jiaan.geng commented on SPARK-28580:


I'm working on.

> ANSI SQL: unique predicate
> --
>
> Key: SPARK-28580
> URL: https://issues.apache.org/jira/browse/SPARK-28580
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Format
>  ::=
> UNIQUE 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28579) MaxAbsScaler avoids conversion to breeze.vector

2019-07-31 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28579:


 Summary: MaxAbsScaler avoids conversion to breeze.vector
 Key: SPARK-28579
 URL: https://issues.apache.org/jira/browse/SPARK-28579
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In current impl, MaxAbsScaler will convert each vector to a breeze.vector in 
transformation.

This should be skipped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27924) ANSI SQL: Boolean Test

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27924:
-

Assignee: jiaan.geng

> ANSI SQL: Boolean Test
> --
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
>
> {quote} ::=
>    [ IS [ NOT ]  ]
>  ::=
>     TRUE
>   | FALSE
>   | UNKNOWN{quote}
>  
> Currently, the following DBMSs support the syntax:
>  * PostgreSQL: [https://www.postgresql.org/docs/9.1/functions-comparison.html]
>  * Hive: https://issues.apache.org/jira/browse/HIVE-13583
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  * Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27924) Support ANSI SQL Boolean-Predicate syntax

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27924:
--
Summary: Support ANSI SQL Boolean-Predicate syntax  (was: ANSI SQL: Boolean 
Test)

> Support ANSI SQL Boolean-Predicate syntax
> -
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> {quote} ::=
>    [ IS [ NOT ]  ]
>  ::=
>     TRUE
>   | FALSE
>   | UNKNOWN{quote}
>  
> Currently, the following DBMSs support the syntax:
>  * PostgreSQL: [https://www.postgresql.org/docs/9.1/functions-comparison.html]
>  * Hive: https://issues.apache.org/jira/browse/HIVE-13583
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  * Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27924) ANSI SQL: Boolean Test

2019-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27924.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25074
[https://github.com/apache/spark/pull/25074]

> ANSI SQL: Boolean Test
> --
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> {quote} ::=
>    [ IS [ NOT ]  ]
>  ::=
>     TRUE
>   | FALSE
>   | UNKNOWN{quote}
>  
> Currently, the following DBMSs support the syntax:
>  * PostgreSQL: [https://www.postgresql.org/docs/9.1/functions-comparison.html]
>  * Hive: https://issues.apache.org/jira/browse/HIVE-13583
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  * Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28578) Improve Github pull request template

2019-07-31 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28578:


 Summary: Improve Github pull request template
 Key: SPARK-28578
 URL: https://issues.apache.org/jira/browse/SPARK-28578
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


See 
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-New-sections-in-Github-Pull-Request-description-template-td27527.html

We should improve our PR template for better review iterations



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28577) Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true

2019-07-31 Thread Yang Jie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-28577:
-
Description: If MEMORY_OFFHEAP_ENABLED is true, we should ensure 
executorOverheadMemory not less than MEMORY_OFFHEAP_SIZE, otherwise the memory 
resource requested for executor may be not enough.  (was: If 
MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory not 
less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested for 
executor is not enough.)

> Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE 
> when MEMORY_OFFHEAP_ENABLED is true
> ---
>
> Key: SPARK-28577
> URL: https://issues.apache.org/jira/browse/SPARK-28577
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.3, 2.3.3, 2.4.3
>Reporter: Yang Jie
>Priority: Major
>
> If MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory 
> not less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested 
> for executor may be not enough.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28577) Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true

2019-07-31 Thread Yang Jie (JIRA)
Yang Jie created SPARK-28577:


 Summary: Ensure executorMemoryHead requested value not less than 
MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true
 Key: SPARK-28577
 URL: https://issues.apache.org/jira/browse/SPARK-28577
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.4.3, 2.3.3, 2.2.3
Reporter: Yang Jie


If MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory not 
less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested for 
executor is not enough.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28529) Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence

2019-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28529.
-
Resolution: Duplicate

> Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence
> 
>
> Key: SPARK-28529
> URL: https://issues.apache.org/jira/browse/SPARK-28529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28529) Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence

2019-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-28529:
---

Assignee: Dilip Biswal

> Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence
> 
>
> Key: SPARK-28529
> URL: https://issues.apache.org/jira/browse/SPARK-28529
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org