date:20191106

[jira] [Resolved] (SPARK-29761) do not output leading 'interval' in CalendarInterval.toString

2019-11-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29761.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26401
[https://github.com/apache/spark/pull/26401]

> do not output leading 'interval' in CalendarInterval.toString
> -
>
> Key: SPARK-29761
> URL: https://issues.apache.org/jira/browse/SPARK-29761
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-11-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968987#comment-16968987
 ] 

Hyukjin Kwon commented on SPARK-29625:
--

[~sanysand...@gmail.com] can you at least make a minimized reproducer?

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> {code}
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> {code}
> which is causing a Data loss issue.  
> {code}
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> [2019-10-28 19:27:40,351]

[jira] [Resolved] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-11-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27763.
--
Resolution: Done

I am leaving this resolved for now. Thanks [~maropu] for finalizing this

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29786) Fix MetaException when dropping a partition not exists on HDFS.

2019-11-06 Thread Deegue (Jira)

Deegue created SPARK-29786:
--

 Summary: Fix MetaException when dropping a partition not exists on 
HDFS.
 Key: SPARK-29786
 URL: https://issues.apache.org/jira/browse/SPARK-29786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


When we drop a partition which doesn't exist on HDFS, we will receive 
`MetaException`.
But actually the partition has been dropped.

In Hive, no exception will thrown in this case.

For example:
If we execute alter table test.tmp drop partition(stat_day=20190516);
(the partition stat_day=20190516 exists on Hive meta, but doesn't exist on HDFS)

We will get :

{code:java}
Error: Error running query: MetaException(message:File does not exist: 
/user/hive/warehouse/test.db/tmp/stat_day=20190516
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2414)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4719)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1237)
   at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:568)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:896)
   at org.apache.hadoop.hdfs. 
protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274)
) (state=,code=0)
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest

2019-11-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29709.
--
Resolution: Incomplete

> structured streaming The offset in the checkpoint is suddenly reset to the 
> earliest
> ---
>
> Key: SPARK-29709
> URL: https://issues.apache.org/jira/browse/SPARK-29709
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: robert
>Priority: Major
>
> structured streaming The offset in the checkpoint is suddenly reset to the 
> earliest,
> One of the partitions in the kafka topic
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29785) Optimize opening a new session of Spark Thrift Server

2019-11-06 Thread Deegue (Jira)

Deegue created SPARK-29785:
--

 Summary: Optimize opening a new session of Spark Thrift Server
 Key: SPARK-29785
 URL: https://issues.apache.org/jira/browse/SPARK-29785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


When we open a new session of Spark Thrift Server, `use default` is called and 
a free executor is needed to execute the SQL. This behavior adds ~5 seconds to 
opening a new session which should only cost ~100ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-06 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968949#comment-16968949
 ] 

pavithra ramachandran commented on SPARK-29784:
---

i shall work on this

> Built in function trim is not compatible in 3.0 with previous version
> -
>
> Key: SPARK-29784
> URL: https://issues.apache.org/jira/browse/SPARK-29784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 
> and 2.3.2 is returning after leading and trailing character removed.
> Spark 3.0 – Not correct
> jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+
> | trim(SL, SSparkSQLS) |
> +---+
> | |
> +---
> Spark 2.4 – Correct
> jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
> +---+--+
> | trim(SSparkSQLS, SL) |
> +---+--+
> | parkSQ |
> +---+--+
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version

2019-11-06 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29784:


 Summary: Built in function trim is not compatible in 3.0 with 
previous version
 Key: SPARK-29784
 URL: https://issues.apache.org/jira/browse/SPARK-29784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 
2.3.2 is returning after leading and trailing character removed.

Spark 3.0 – Not correct

jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+
| trim(SL, SSparkSQLS) |
+---+
| |
+---

Spark 2.4 – Correct

jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS');
+---+--+
| trim(SSparkSQLS, SL) |
+---+--+
| parkSQ |
+---+--+

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29771) Limit executor max failures before failing the application

2019-11-06 Thread Jackey Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968933#comment-16968933
 ] 

Jackey Lee edited comment on SPARK-29771 at 11/7/19 5:12 AM:
-

This patch is mainly used in the scenario where the executor started failed. 
The executor runtime failure, which is caused by task errors is controlled by 
spark.executor.maxFailures.

Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after 
spark-submit, which can also appear executor crazy retry.


was (Author: jackey lee):
This patch is mainly used in the scenario where the executor started failed. 
The executor runtime failurem, which is caused by task errors is controlled by 
spark.executor.maxFailures.

Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after 
spark-submit, which can also appear executor crazy retry.

> Limit executor max failures before failing the application
> --
>
> Key: SPARK-29771
> URL: https://issues.apache.org/jira/browse/SPARK-29771
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, K8S scheduling does not limit the number of failures of the 
> executors, which may cause executor retried continuously without failing.
> A simple example, we add a resource limit on default namespace. After the 
> driver is started, if the quota is full, the executor will retry the creation 
> continuously, resulting in a large amount of pod information accumulation. 
> When many applications encounter such situations, they will affect the K8S 
> cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29771) Limit executor max failures before failing the application

2019-11-06 Thread Jackey Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968933#comment-16968933
 ] 

Jackey Lee commented on SPARK-29771:


This patch is mainly used in the scenario where the executor started failed. 
The executor runtime failurem, which is caused by task errors is controlled by 
spark.executor.maxFailures.

Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after 
spark-submit, which can also appear executor crazy retry.

> Limit executor max failures before failing the application
> --
>
> Key: SPARK-29771
> URL: https://issues.apache.org/jira/browse/SPARK-29771
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> At present, K8S scheduling does not limit the number of failures of the 
> executors, which may cause executor retried continuously without failing.
> A simple example, we add a resource limit on default namespace. After the 
> driver is started, if the quota is full, the executor will retry the creation 
> continuously, resulting in a large amount of pod information accumulation. 
> When many applications encounter such situations, they will affect the K8S 
> cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27986) Support Aggregate Expressions with filter

2019-11-06 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27986:
---
Description: 
An aggregate expression represents the application of an aggregate function 
across the rows selected by a query. An aggregate function reduces multiple 
inputs to a single output value, such as the sum or average of the inputs. The 
syntax of an aggregate expression is one of the following:
{noformat}
aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( 
WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ 
FILTER ( WHERE filter_clause ) ]{noformat}
[https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES]

 

I find this feature is also an ANSI SQL standard. 
{code:java}
 ::=
COUNT[  ]
|  [  ]
|  [  ]
|  [  ]
|  [  ]
|  [  ]{code}

  was:
An aggregate expression represents the application of an aggregate function 
across the rows selected by a query. An aggregate function reduces multiple 
inputs to a single output value, such as the sum or average of the inputs. The 
syntax of an aggregate expression is one of the following:
{noformat}
aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( 
WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ 
FILTER ( WHERE filter_clause ) ]{noformat}
[https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES]

 


> Support Aggregate Expressions with filter
> -
>
> Key: SPARK-27986
> URL: https://issues.apache.org/jira/browse/SPARK-27986
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> An aggregate expression represents the application of an aggregate function 
> across the rows selected by a query. An aggregate function reduces multiple 
> inputs to a single output value, such as the sum or average of the inputs. 
> The syntax of an aggregate expression is one of the following:
> {noformat}
> aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
> filter_clause ) ]
> aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( 
> WHERE filter_clause ) ]
> aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER 
> ( WHERE filter_clause ) ]
> aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
> aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) 
> [ FILTER ( WHERE filter_clause ) ]{noformat}
> [https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES]
>  
> I find this feature is also an ANSI SQL standard. 
> {code:java}
>  ::=
> COUNT[  ]
> |  [  ]
> |  [  ]
> |  [  ]
> |  [  ]
> |  [  ]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29605) Optimize string to interval casting

2019-11-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29605:
---

Assignee: Maxim Gekk

> Optimize string to interval casting
> ---
>
> Key: SPARK-29605
> URL: https://issues.apache.org/jira/browse/SPARK-29605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Implement new function stringToInterval in IntervalUtils to cast a value of 
> UTF8String to an instance of CalendarInterval that should be faster than 
> existing implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29605) Optimize string to interval casting

2019-11-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29605.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26256
[https://github.com/apache/spark/pull/26256]

> Optimize string to interval casting
> ---
>
> Key: SPARK-29605
> URL: https://issues.apache.org/jira/browse/SPARK-29605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Implement new function stringToInterval in IntervalUtils to cast a value of 
> UTF8String to an instance of CalendarInterval that should be faster than 
> existing implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29600) array_contains built in function is not backward compatible in 3.0

2019-11-06 Thread ABHISHEK KUMAR GUPTA (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968909#comment-16968909
 ] 

ABHISHEK KUMAR GUPTA commented on SPARK-29600:
--

[~Udbhav Agrawal]: I gone through the migration guide and it is more talking 
about the "In Spark 2.4, AnalysisException is thrown since integer type can not 
be promoted to string type in a loss-less manner." which is not the case in the 
above query.

> array_contains built in function is not backward compatible in 3.0
> --
>
> Key: SPARK-29600
> URL: https://issues.apache.org/jira/browse/SPARK-29600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
> Exception in 3.0 where as in 2.3.2 is working fine.
> Spark 3.0 output:
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
>  Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
> CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
> DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS 
> DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function 
> array_contains should have been array followed by a value with same element 
> type, but it's [array, decimal(1,1)].; line 1 pos 7;
>  'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
> cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
> decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
> cast(0.033 as decimal(13,3))), 0.2), None)]
> Spark 2.3.2 output
> 0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
> |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
> CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
> DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
> CAST(0.2 AS DECIMAL(13,3)))|
> |true|
> 1 row selected (0.18 seconds)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-11-06 Thread Sandish Kumar HN (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968907#comment-16968907
 ] 

Sandish Kumar HN edited comment on SPARK-29625 at 11/7/19 4:04 AM:
---

[~hyukjin.kwon]

It happened again, As I can't share the code here, I will just post the flow of 
the code, 

Look at partition-32 at [2019-11-06 04:33:40,013]

And at [2019-11-06 04:33:40,015]

Why is Spark is trying to reset twice within the same batch? one with right 
offset and another with the wrong offset.


 df-reader=spark.read
 df.transfer=spark.map (spark transforms functions)
 df-stream-writer=spark.readStream
 df-stream-writer=spark.writeStream

SourceCode Flow --

analytics-session:
 type: spark
 name: etl-flow-correlate
 options: (
 "spark.sql.shuffle.partitions" : "2" ;
 "spark.sql.windowExec.buffer.spill.threshold" : "98304"
 )

df-reader:
 name: ipmac
 format: jdbc
 options: (
 "url" : "postgres.jdbc.url" ;
 "driver" : "org.postgresql.Driver" ;
 "dbtable" : "tablename"
 )

df-transform:
 name: ColumnRename
 transform: alias
 input: ipmac
 options: (
 "id" : "key" ;
 "ip" : "src_ip"
 )

df-stream-reader:
 name: flow
 format: kafka
 options: (
 "kafka.bootstrap.servers": "kafka.server" ;
 "subscribe": "topic name" ;
 "maxOffsetsPerTrigger": "40" ;
 "startingOffsets": "latest" ;
 "failOnDataLoss": "true"
 )

df-transform:
 name: decodedFlow
 transform: EventDecode
 input: flow
 options: (
 "schema" : "flow"
 )

df-transform:
 name: flowWithDate
 transform: convertTimestampCols
 input: decodedFlow
 options: (
 "timestampcols" : "start_time,end_time" ;
 "resolution" : "microsec"
 )

df-transform:
 name: correlatedFlow
 transform: correlateOverWindow
 input: flowWithDate
 options: (
 "rightDF" : "ipmacColumnRename" ;
 "timestampcol" : "start_time" ;
 "timewindow" : "1 minute" ;
 "join-col" : "key,src_ip" ;
 "expires-min" : "expires_in_min"
 )

df-transform:
 name: nonNullFlow
 transform: filter
 input: correlatedFlow
 options: (
 "notnull" : "mac"
 )

df-transform:
 name: flowColumnRename
 transform: alias
 input: nonNullFlow
 options: (
 "key" : "tid"
 )

df-transform:
 name: flowEncoded
 transform: EventEncode
 input: flowColumnRename
 options: (
 "key" : "id"
 )

df-stream-writer:
 name: flowEncoded
 format: kafka
 options: (
 "kafka.bootstrap.servers" : "kafka.server" ;
 "checkpointLocation" : "spark.checkpoint.flow-correlation" ;
 "trigger-processing" : "10 seconds" ;
 "query-duration" : "60 minutes" ;
 "topic" : "topicname"
 )

Logs- 
---
 2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-185 to offset 0.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-32 to offset 16395511.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-65 to offset 0.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-98 to offset 0.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-131 to offset 0.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-164 to offset 1684812.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-197 to offset 0.
 [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 
INFO Fetcher: [Consumer clientId=consumer-1,

[jira] [Reopened] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-11-06 Thread Sandish Kumar HN (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN reopened SPARK-29625:
--

It happened again, As I can't share the code here, I will just post the flow of 
the code, 

Look at partition-32 at [2019-11-06 04:33:40,013]

And at [2019-11-06 04:33:40,015]

Why is Spark is trying to reset twice within the same batch? one with right 
offset and another with the wrong offset.


df-reader=spark.read
df.transfer=spark.map (spark transforms functions)
df-stream-writer=spark.readStream
df-stream-writer=spark.writeStream

SourceCode Flow --

analytics-session:
 type: spark
 name: etl-flow-correlate
 options: (
 "spark.sql.shuffle.partitions" : "2" ;
 "spark.sql.windowExec.buffer.spill.threshold" : "98304"
 )

df-reader:
 name: ipmac
 format: jdbc
 options: (
 "url" : "postgres.jdbc.url" ;
 "driver" : "org.postgresql.Driver" ;
 "dbtable" : "tablename"
 )

df-transform:
 name: ColumnRename
 transform: alias
 input: ipmac
 options: (
 "id" : "key" ;
 "ip" : "src_ip"
 )

df-stream-reader:
 name: flow
 format: kafka
 options: (
 "kafka.bootstrap.servers": "kafka.server" ;
 "subscribe": "topic name" ;
 "maxOffsetsPerTrigger": "40" ;
 "startingOffsets": "latest" ;
 "failOnDataLoss": "true"
 )

df-transform:
 name: decodedFlow
 transform: EventDecode
 input: flow
 options: (
 "schema" : "flow"
 )

df-transform:
 name: flowWithDate
 transform: convertTimestampCols
 input: decodedFlow
 options: (
 "timestampcols" : "start_time,end_time" ;
 "resolution" : "microsec"
 )

df-transform:
 name: correlatedFlow
 transform: correlateOverWindow
 input: flowWithDate
 options: (
 "rightDF" : "ipmacColumnRename" ;
 "timestampcol" : "start_time" ;
 "timewindow" : "1 minute" ;
 "join-col" : "key,src_ip" ;
 "expires-min" : "expires_in_min"
 )

df-transform:
 name: nonNullFlow
 transform: filter
 input: correlatedFlow
 options: (
 "notnull" : "mac"
 )

df-transform:
 name: flowColumnRename
 transform: alias
 input: nonNullFlow
 options: (
 "key" : "tid"
 )

df-transform:
 name: flowEncoded
 transform: EventEncode
 input: flowColumnRename
 options: (
 "key" : "id"
 )

df-stream-writer:
 name: flowEncoded
 format: kafka
 options: (
 "kafka.bootstrap.servers" : "kafka.server" ;
 "checkpointLocation" : "spark.checkpoint.flow-correlation" ;
 "trigger-processing" : "10 seconds" ;
 "query-duration" : "60 minutes" ;
 "topic" : "topicname"
 )

Logs 

2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-185 to offset 0.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-32 to offset 16395511.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-65 to offset 0.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-98 to offset 0.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-131 to offset 0.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-164 to offset 1684812.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-197 to offset 0.
[2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0]
 Resetting offset for partition collector.endpoint.flow-11 to offset 10372315.
[2019-11-06

[jira] [Comment Edited] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968895#comment-16968895
 ] 

Udit Mehrotra edited comment on SPARK-29767 at 11/7/19 3:41 AM:


[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 
0x7fbea94ffabe [0x7fbea94ffa00+0xbe]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j  scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j  
org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j  
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j  
org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j  org.apache.spark.executor.Executor$TaskRunner.run()V+383
j  
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x680c5e]
V  [libjvm.so+0x67e024]
V  [libjvm.so+0x67e639]
V  [libjvm.so+0x6c3d41]
V  [libjvm.so+0xa77c22]
V  [libjvm.so+0x8c3b12]
C  [libpthread.so.0+0x7de5]  start_thread+0xc5{noformat}

Also attached the core dump file *coredump.zip*


was (Author: uditme):
[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Attachment: coredump.zip

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
> at

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968895#comment-16968895
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] was finally able to get the core dump of crashing executors. 
Attached *hs_err_pid13885.log* the error report written along with core dump.

In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: 
[0x7fbe8850f000,0x7fbe8861],  sp=0x7fbe8860dad0,  free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa9ae92]
J 4331  sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 
0x7fbea94ffabe [0x7fbea94ffa00+0xbe]
j  org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j  org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j  org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j  scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j  
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j  
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j  
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j  
org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j  
org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j  
org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j  
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j  
org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j  org.apache.spark.executor.Executor$TaskRunner.run()V+383
j  
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x680c5e]
V  [libjvm.so+0x67e024]
V  [libjvm.so+0x67e639]
V  [libjvm.so+0x6c3d41]
V  [libjvm.so+0xa77c22]
V  [libjvm.so+0x8c3b12]
C  [libpthread.so.0+0x7de5]  start_thread+0xc5{noformat}

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated SPARK-29767:
--
Attachment: hs_err_pid13885.log

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr
> at

[jira] [Comment Edited] (SPARK-28794) Document CREATE TABLE in SQL Reference.

2019-11-06 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966433#comment-16966433
 ] 

pavithra ramachandran edited comment on SPARK-28794 at 11/7/19 3:08 AM:


i shall raise PR by weekend.


was (Author: pavithraramachandran):
i shall raise PR by tomorrow.

> Document CREATE TABLE in SQL Reference.
> ---
>
> Key: SPARK-28794
> URL: https://issues.apache.org/jira/browse/SPARK-28794
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29783) Support SQL Standard output style for interval type

2019-11-06 Thread Kent Yao (Jira)

Kent Yao created SPARK-29783:


 Summary: Support SQL Standard output style for interval type
 Key: SPARK-29783
 URL: https://issues.apache.org/jira/browse/SPARK-29783
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Support sql standard interval-style for output.

 
||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval||
|{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06|
|{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes 
4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-06 Thread yueqi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968881#comment-16968881
 ] 

yueqi commented on SPARK-29778:
---

Does v1 table support append? just checking

> saveAsTable append mode is not passing writer options
> -
>
> Key: SPARK-29778
> URL: https://issues.apache.org/jira/browse/SPARK-29778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Critical
>
> There was an oversight where AppendData is not getting the WriterOptions in 
> saveAsTable. 
> [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Description: 
In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
Broadcast.destroy() method can not destroy the broadcast data, the driver and 
executor storage memory in spark ui is continuous increase。
{code:java}
val batch = Seq(1 to : _*) 
val strSeq = batch.map(i => s"xxh-$i") 
val rdd = sc.parallelize(strSeq) 
rdd.cache() 
batch.foreach(_ => { 
  val broc = sc.broadcast(strSeq) 
  rdd.map(id => broc.value.contains(id)).collect() 
  broc.destroy() 
})
{code}
 

  was:
In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
Broadcast.destroy() method can not destroy the broadcast data, the driver and 
executor storage memory in spark ui is continuous increase。
{code:java}
//代码占位符
val batch = Seq(1 to : _*) 
val strSeq = batch.map(i => s"xxh-$i") 
val rdd = sc.parallelize(strSeq) 
rdd.cache() 
batch.foreach(_ => { 
  val broc = sc.broadcast(strSeq) 
  rdd.map(id => broc.value.contains(id)).collect() 
  broc.destroy() 
})
{code}
 


> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: correct version.png, problem versions.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-06 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968853#comment-16968853
 ] 

Sean R. Owen commented on SPARK-29760:
--

I mean, what would you add where? you could open a PR to propose it. It may not 
be worth mentioning separately unless there's a clear place for it.

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-06 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968850#comment-16968850
 ] 

jobit mathew commented on SPARK-29760:
--

[~srowen],it can be added as a part of Build a SQL reference doc
https://issues.apache.org/jira/browse/SPARK-28588

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: problem versions.png

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: correct version.png, problem versions.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: correct version.png

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: correct version.png, problem versions.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: (was: 33.png)

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: correct version.png, problem versions.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: (was: 44.png)

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: correct version.png, problem versions.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Description: 
In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
Broadcast.destroy() method can not destroy the broadcast data, the driver and 
executor storage memory in spark ui is continuous increase。
{code:java}
//代码占位符
val batch = Seq(1 to : _*) 
val strSeq = batch.map(i => s"xxh-$i") 
val rdd = sc.parallelize(strSeq) 
rdd.cache() 
batch.foreach(_ => { 
  val broc = sc.broadcast(strSeq) 
  rdd.map(id => broc.value.contains(id)).collect() 
  broc.destroy() 
})
{code}
 

  was:
In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
Broadcast.destroy() method can not destroy the broadcast data, the driver and 
executor storage memory in spark ui is continuous increase。
{code:java}
//代码占位符
val batch = Seq(1 to : _*) 
val strSeq = batch.map(i => s"xxh-$i") v
al rdd = sc.parallelize(strSeq) 
rdd.cache() 
batch.foreach(_ => { 
  val broc = sc.broadcast(strSeq) 
  rdd.map(id => broc.value.contains(id)).collect() 
  broc.destroy() 
})
{code}
 


> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: 33.png, 44.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") 
> val rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: 44.png

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: 33.png, 44.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") v
> al rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: 33.png

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
> Attachments: 33.png
>
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") v
> al rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)

tzxxh created SPARK-29782:
-

 Summary: spark broadcast can not be destoryed in some versions
 Key: SPARK-29782
 URL: https://issues.apache.org/jira/browse/SPARK-29782
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.3.3
Reporter: tzxxh


In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
Broadcast.destroy() method can not destroy the broadcast data, the driver and 
executor storage memory in spark ui is continuous increase。
{code:java}
//代码占位符
val batch = Seq(1 to : _*) 
val strSeq = batch.map(i => s"xxh-$i") v
al rdd = sc.parallelize(strSeq) 
rdd.cache() 
batch.foreach(_ => { 
  val broc = sc.broadcast(strSeq) 
  rdd.map(id => broc.value.contains(id)).collect() 
  broc.destroy() 
})
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: 1.png

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") v
> al rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions

2019-11-06 Thread tzxxh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tzxxh updated SPARK-29782:
--
Attachment: (was: 1.png)

> spark broadcast can not be destoryed in some versions
> -
>
> Key: SPARK-29782
> URL: https://issues.apache.org/jira/browse/SPARK-29782
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: tzxxh
>Priority: Major
>
> In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use 
> Broadcast.destroy() method can not destroy the broadcast data, the driver and 
> executor storage memory in spark ui is continuous increase。
> {code:java}
> //代码占位符
> val batch = Seq(1 to : _*) 
> val strSeq = batch.map(i => s"xxh-$i") v
> al rdd = sc.parallelize(strSeq) 
> rdd.cache() 
> batch.foreach(_ => { 
>   val broc = sc.broadcast(strSeq) 
>   rdd.map(id => broc.value.contains(id)).collect() 
>   broc.destroy() 
> })
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29635.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26292
[https://github.com/apache/spark/pull/26292]

> Deduplicate test suites between Kafka micro-batch sink and Kafka continuous 
> sink
> 
>
> Key: SPARK-29635
> URL: https://issues.apache.org/jira/browse/SPARK-29635
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's a comment in KafkaContinuousSinkSuite which is most likely explaining 
> TODO:
> https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39
> {noformat}
> /**
>  * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 
> memory stream.
>  * Once we have one, this will be changed to a specialization of 
> KafkaSinkSuite and we won't have
>  * to duplicate all the code.
>  */
> {noformat}
> Given latest master branch has V2 memory stream now, it is a good time to 
> deduplicate two suites into one, via having base class and let these suites 
> override necessary things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29635:
--

Assignee: Jungtaek Lim

> Deduplicate test suites between Kafka micro-batch sink and Kafka continuous 
> sink
> 
>
> Key: SPARK-29635
> URL: https://issues.apache.org/jira/browse/SPARK-29635
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> There's a comment in KafkaContinuousSinkSuite which is most likely explaining 
> TODO:
> https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39
> {noformat}
> /**
>  * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 
> memory stream.
>  * Once we have one, this will be changed to a specialization of 
> KafkaSinkSuite and we won't have
>  * to duplicate all the code.
>  */
> {noformat}
> Given latest master branch has V2 memory stream now, it is a good time to 
> deduplicate two suites into one, via having base class and let these suites 
> override necessary things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29720) Add linux condition to make ProcfsMetricsGetter more complete

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29720.
--
Resolution: Not A Problem

> Add linux condition to make ProcfsMetricsGetter more complete
> -
>
> Key: SPARK-29720
> URL: https://issues.apache.org/jira/browse/SPARK-29720
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> Now decide whether it can be gather proc stat to executor metrics is that 
> {code:java}
> procDirExists.get && shouldLogStageExecutorProcessTreeMetrics && 
> shouldLogStageExecutorMetrics
> {code}
> But the proc is only support for linux, so it should add a condition that 
> isLinux .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17814) spark submit arguments are truncated in yarn-cluster mode

2019-11-06 Thread Juan Ramos Fuentes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968816#comment-16968816
 ] 

Juan Ramos Fuentes commented on SPARK-17814:


Looks like I'm about 3 years too late, but I just encountered this same issue. 
I'm also passing a string of JSON as an arg to spark-submit and seeing the last 
two curly braces being replaced with empty string. My temporary solution is to 
avoid having two curly braces next to each other, but I was wondering if you 
have any thoughts for how to solve this issue [~jerryshao] [~shreyass123]

> spark submit arguments are truncated in yarn-cluster mode
> -
>
> Key: SPARK-17814
> URL: https://issues.apache.org/jira/browse/SPARK-17814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 1.6.1
>Reporter: shreyas subramanya
>Priority: Minor
>
> {noformat}
> One of our customers is trying to pass in json through spark-submit as 
> follows:
> spark-submit --verbose --class SimpleClass --master yarn-cluster ./simple.jar 
> "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}"
> The application reports the passed arguments as: {"mode":"wf", 
> "arrays":{"array":[1]
> If the same application is submitted in yarn-client mode, as follows:
> spark-submit --verbose --class SimpleClass --master yarn-client ./simple.jar 
> "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}"
> The application reports the passed args as: {"mode":"wf", 
> "arrays":{"array":[1]}}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29781) Override SBT Jackson-databind dependency like Maven

2019-11-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29781:
--
Summary: Override SBT Jackson-databind dependency like Maven  (was: 
Override SBT Jackson dependency like Maven)

> Override SBT Jackson-databind dependency like Maven
> ---
>
> Key: SPARK-29781
> URL: https://issues.apache.org/jira/browse/SPARK-29781
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is `branch-2.4` only issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29781) Override SBT Jackson dependency like Maven

2019-11-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29781:
-

 Summary: Override SBT Jackson dependency like Maven
 Key: SPARK-29781
 URL: https://issues.apache.org/jira/browse/SPARK-29781
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 2.4.4
Reporter: Dongjoon Hyun


This is `branch-2.4` only issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29781) Override SBT Jackson dependency like Maven

2019-11-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29781:
--
Component/s: (was: Spark Core)

> Override SBT Jackson dependency like Maven
> --
>
> Key: SPARK-29781
> URL: https://issues.apache.org/jira/browse/SPARK-29781
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is `branch-2.4` only issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29780) The UI can access into the ResourceAllocator, whose data structures are being updated from scheduler threads

2019-11-06 Thread Alessandro Bellina (Jira)

Alessandro Bellina created SPARK-29780:
--

 Summary: The UI can access into the ResourceAllocator, whose data 
structures are being updated from scheduler threads
 Key: SPARK-29780
 URL: https://issues.apache.org/jira/browse/SPARK-29780
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Alessandro Bellina


A class extending ResourceAllocator (WorkerResourceInfo), has some potential 
issues (raised here: 
https://github.com/apache/spark/pull/26078#discussion_r340342820)

The WorkerInfo class is calling into availableAddrs and assignedAddrs but those 
calls appear to be coming from the UI (looking at the resourcesInfo* 
functions), e.g. JsonProtocol and MasterPage call this. Since the 
datastructures in ResourceAllocator are not concurrent, we could end up with 
bad data or potentially crashes, depending on when the calls are made. 

Note that there are other calls to the resourceInfo* functions, but those are 
from the event loop.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29759) LocalShuffleReaderExec.outputPartitioning should use the corrected attributes

2019-11-06 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29759.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> LocalShuffleReaderExec.outputPartitioning should use the corrected attributes
> -
>
> Key: SPARK-29759
> URL: https://issues.apache.org/jira/browse/SPARK-29759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968760#comment-16968760
 ] 

shahid commented on SPARK-29765:


Event drop can also happen that if we don't provide sufficient memory. That 
case increasing the configs won't help I think. 

If the event drop happens the entire UI behaves weirdly.

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2019-11-06 Thread Udit Mehrotra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968757#comment-16968757
 ] 

Udit Mehrotra commented on SPARK-29767:
---

[~hyukjin.kwon] As I have mentioned in the description and you an see from the 
*stdout* logs it fails to write the *core dump*. Any idea how I can get around 
it ?

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack
>  trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted  
>
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' 
> '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' 
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' 
> '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' 
> -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' 
> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77
>  org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 
> --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id 
> application_1572981097605_0021 --user-class-path 
> file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar
>  > 
> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout
>

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code}
df <- createDataFrame(data.frame(x=1))
f1 <- function(x) x + 1
f2 <- function(x) f1(x) + 2

dapplyCollect(df, function(x) { f1(x); f2(x) })
{code}

We get following error message:

{code}
org.apache.spark.SparkException: R computation failed with
 Error in f1(x) : could not find function "f1"
Calls: compute -> computeFunc -> f2
{code}

Compare that to this code block with succeeds:
{code}
dapplyCollect(df, function(x) { f2(x) })
{code}

  was:
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block that succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968753#comment-16968753
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

What's in particularly interesting, that the error doesn't reproduce when this 
stage is in completed list.

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748){code}

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968751#comment-16968751
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

Increased the capacity to 2 as pointed out in docs: 
[https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html#scheduling]

 

But the error still happens.

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
>

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-11-06 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968744#comment-16968744
 ] 

Bryan Cutler commented on SPARK-28502:
--

Ahh, so Arrow 0.15.0+ had a change in the IPC format that requires pyspark to 
set an env var. See 
[https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x,]
 that should fix the problem with the Spark preview and once SPARK-29376 is 
merged in 3.0, you won't need to do this.

{quote} I have to manually add window to returning dataframe. Is there a way to 
automatically concatenate results of udf?  {quote}

I don't believe there is a way to add the key/window in the DataFrame 
automatically, you will have to manually add it in the udf.

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968739#comment-16968739
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

I think it is capacity after code review ;)

spark.scheduler.listenerbus.eventqueue.capacity

 

Thanks a lot!

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at

[jira] [Updated] (SPARK-29779) Compact old event log files and clean up

2019-11-06 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29779:
-
Parent: SPARK-28594
Issue Type: Sub-task  (was: Task)

> Compact old event log files and clean up
> 
>
> Key: SPARK-29779
> URL: https://issues.apache.org/jira/browse/SPARK-29779
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue is to track the effort on compacting old event logs (and cleaning 
> up after compaction) without breaking guaranteeing of compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968737#comment-16968737
 ] 

shahid commented on SPARK-29765:


Thanks. Yes, you can try increasing 
`spark.scheduler.listenerbus.eventqueue.size`. 

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira

[jira] [Created] (SPARK-29779) Compact old event log files and clean up

2019-11-06 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29779:


 Summary: Compact old event log files and clean up
 Key: SPARK-29779
 URL: https://issues.apache.org/jira/browse/SPARK-29779
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue is to track the effort on compacting old event logs (and cleaning up 
after compaction) without breaking guaranteeing of compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block that succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}

  was:
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block with succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code:java}
> library(SparkR)
> sparkR.session()
> spark_df <- createDataFrame(na.omit(airquality))
> cody_local2 <- function(param2) {
>  10 + param2
> }
> cody_local1 <- function(param1) {
>  cody_local2(param1)
> }
> result <- cody_local2(5)
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  cody_local1(5)
> })
> print(result)
> {code}
> We get following error message:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
> (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
> computation failed with
>  Error in cody_local2(param1) : could not find function "cody_local2"
> Calls: compute -> computeFunc -> cody_local1
> {code}
> Compare that to this code block that succeeds:
> {code:java}
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  #cody_local1(5)
> })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-29777:
---
Description: 
Following code block reproduces the issue:
{code:java}
library(SparkR)
sparkR.session()
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}
Compare that to this code block with succeeds:
{code:java}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}

  was:
Following code block reproduces the issue:
{code}
library(SparkR)
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}

Compare that to this code block with succeeds:
{code}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}


> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.4
>Reporter: Hossein Falaki
>Priority: Major
>
> Following code block reproduces the issue:
> {code:java}
> library(SparkR)
> sparkR.session()
> spark_df <- createDataFrame(na.omit(airquality))
> cody_local2 <- function(param2) {
>  10 + param2
> }
> cody_local1 <- function(param1) {
>  cody_local2(param1)
> }
> result <- cody_local2(5)
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  cody_local1(5)
> })
> print(result)
> {code}
> We get following error message:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
> (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
> computation failed with
>  Error in cody_local2(param1) : could not find function "cody_local2"
> Calls: compute -> computeFunc -> cody_local1
> {code}
> Compare that to this code block with succeeds:
> {code:java}
> calc_df <- dapplyCollect(spark_df, function(x) {
>  cody_local2(20)
>  #cody_local1(5)
> })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968720#comment-16968720
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

Exactly as you said:
{code:java}
ERROR AsyncEventQueue: Dropping event from queue appStatus. This likely means 
one of the listeners is too slow and cannot keep up with the rate at which 
tasks are being started by the scheduler{code}
Will increase the capacity and go back with results. Thanks!

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
>

[jira] [Comment Edited] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-11-06 Thread Nasir Ali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968703#comment-16968703
 ] 

Nasir Ali edited comment on SPARK-28502 at 11/6/19 9:17 PM:


[~bryanc] If I perform any agg (e.g.,  _df.groupBy("id",F.window("ts", "15 
days")).agg(\{"id":"avg"}).show()_ ) on grouped data, pyspark returns me the 
key (e.g., id, window) with the avg for each group. However, in the above 
example, when udf returns the struct, it does not automatically return the key. 
I have to manually add window to returning dataframe. Is there a way to 
automatically concatenate results of udf?


was (Author: nasirali):
[~bryanc] If I perform any agg (e.g., avg) on grouped data, pyspark returns me 
the key (e.g., window etc.) with the avg for each row. However, in the above 
example, when udf returns the struct, it does not automatically return the key. 
I have to manually add window to returning dataframe. Is there a way to 
automatically concatenate results of udf with key?

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968707#comment-16968707
 ] 

shahid commented on SPARK-29765:


Yes, then it seems some event drop has happened. Could you check the logs 
related to the event drop? To prevent this, I think you can increase the Event 
queue capacity which is by default 1000 I think

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
>

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968704#comment-16968704
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

[~shahid]

Reproduced.

An interesting thing is that despite the stage was marked as completed, on UI 
it showed tasks 106632/109889 (3 running).

 

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-11-06 Thread Nasir Ali (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968703#comment-16968703
 ] 

Nasir Ali commented on SPARK-28502:
---

[~bryanc] If I perform any agg (e.g., avg) on grouped data, pyspark returns me 
the key (e.g., window etc.) with the avg for each row. However, in the above 
example, when udf returns the struct, it does not automatically return the key. 
I have to manually add window to returning dataframe. Is there a way to 
automatically concatenate results of udf with key?

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17495) Hive hash implementation

2019-11-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968657#comment-16968657
 ] 

L. C. Hsieh commented on SPARK-17495:
-

This looks should be resolved now? Last comments from [~tejasp] is about time 
related datatypes. Seems they were added long time ago and there is no todos in 
the code.

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: bulk-closed
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-06 Thread Burak Yavuz (Jira)

Burak Yavuz created SPARK-29778:
---

 Summary: saveAsTable append mode is not passing writer options
 Key: SPARK-29778
 URL: https://issues.apache.org/jira/browse/SPARK-29778
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


There was an oversight where AppendData is not getting the WriterOptions in 
saveAsTable. 
[https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968652#comment-16968652
 ] 

shahid commented on SPARK-29765:


I still not sure about the root cause, as I am not able to reproduce with small 
data. From the number I can see that it is something related to cleaning up the 
store, when the number of tasks exceeds the threshold. If you can still 
reproduce with the same data even after increasing the threshold, then it might 
be due to some other issue.

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
>

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread Viacheslav Tradunsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968649#comment-16968649
 ] 

Viacheslav Tradunsky commented on SPARK-29765:
--

[~shahid] 

sure, in case we have more than 200 000 tasks shall we set this to our maximum?

Just trying to understand how GC of objects could influence access in immutable 
collection. Do you allow somehow the elements to be collected when they are 
referenced by that vector?

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
>

[jira] [Created] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2019-11-06 Thread Hossein Falaki (Jira)

Hossein Falaki created SPARK-29777:
--

 Summary: SparkR::cleanClosure aggressively removes a function 
required by user function
 Key: SPARK-29777
 URL: https://issues.apache.org/jira/browse/SPARK-29777
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.4
Reporter: Hossein Falaki


Following code block reproduces the issue:
{code}
library(SparkR)
spark_df <- createDataFrame(na.omit(airquality))

cody_local2 <- function(param2) {
 10 + param2
}
cody_local1 <- function(param1) {
 cody_local2(param1)
}

result <- cody_local2(5)
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 cody_local1(5)
})
print(result)
{code}
We get following error message:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 
(TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R 
computation failed with
 Error in cody_local2(param1) : could not find function "cody_local2"
Calls: compute -> computeFunc -> cody_local1
{code}

Compare that to this code block with succeeds:
{code}
calc_df <- dapplyCollect(spark_df, function(x) {
 cody_local2(20)
 #cody_local1(5)
})
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968642#comment-16968642
 ] 

shahid edited comment on SPARK-29765 at 11/6/19 7:39 PM:
-

Just to narrow down the problem, can you please increase 
`spark.ui.retainedTasks` from 10 to 20 and check if the issue still 
exist ? Thanks


was (Author: shahid):
Just to narrow down the problem, can you please increase 
`spark.ui.retainedTasks` from 10 to 20 ? Thanks

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
>

[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-06 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968642#comment-16968642
 ] 

shahid commented on SPARK-29765:


Just to narrow down the problem, can you please increase 
`spark.ui.retainedTasks` from 10 to 20 ? Thanks

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748){code}



--
This message was

[jira] [Assigned] (SPARK-29642) ContinuousMemoryStream throws error on String type

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29642:
--

Assignee: Jungtaek Lim

> ContinuousMemoryStream throws error on String type
> --
>
> Key: SPARK-29642
> URL: https://issues.apache.org/jira/browse/SPARK-29642
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> While we can set String as a generic type of ContinuousMemoryStream, it 
> doesn't work really because it doesn't convert String to UTFString and 
> accessing it from Row interface would throw error.
> We should encode the input and convert the input to Row properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29642) ContinuousMemoryStream throws error on String type

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29642.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26300
[https://github.com/apache/spark/pull/26300]

> ContinuousMemoryStream throws error on String type
> --
>
> Key: SPARK-29642
> URL: https://issues.apache.org/jira/browse/SPARK-29642
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> While we can set String as a generic type of ContinuousMemoryStream, it 
> doesn't work really because it doesn't convert String to UTFString and 
> accessing it from Row interface would throw error.
> We should encode the input and convert the input to Row properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-29571) Fix UT in AllExecutionsPageSuite class

2019-11-06 Thread Ankit Raj Boudh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh closed SPARK-29571.
---

Issue got fixed.

> Fix UT in  AllExecutionsPageSuite class
> ---
>
> Key: SPARK-29571
> URL: https://issues.apache.org/jira/browse/SPARK-29571
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Ankit Raj Boudh
>Assignee: Ankit Raj Boudh
>Priority: Minor
> Fix For: 3.0.0
>
>
> sorting should be successful UT in class AllExecutionsPageSuite failing due 
> to invalid assert condition. Needs to handle this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29752) make AdaptiveQueryExecSuite more robust

2019-11-06 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29752.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> make AdaptiveQueryExecSuite more robust
> ---
>
> Key: SPARK-29752
> URL: https://issues.apache.org/jira/browse/SPARK-29752
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29603) Support application priority for spark on yarn

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29603:
--

Assignee: Kent Yao

> Support application priority for spark on yarn
> --
>
> Key: SPARK-29603
> URL: https://issues.apache.org/jira/browse/SPARK-29603
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> We can set priority to an application for YARN to define pending applications 
> ordering policy, those with higher priority have a better opportunity to be 
> activated. YARN CapacityScheduler only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29603) Support application priority for spark on yarn

2019-11-06 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29603.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26255
[https://github.com/apache/spark/pull/26255]

> Support application priority for spark on yarn
> --
>
> Key: SPARK-29603
> URL: https://issues.apache.org/jira/browse/SPARK-29603
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> We can set priority to an application for YARN to define pending applications 
> ordering policy, those with higher priority have a better opportunity to be 
> activated. YARN CapacityScheduler only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29588) Improvements in WebUI JDBC/ODBC server page

2019-11-06 Thread shahid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-29588:
---
Priority: Major  (was: Minor)

> Improvements in WebUI JDBC/ODBC server page
> ---
>
> Key: SPARK-29588
> URL: https://issues.apache.org/jira/browse/SPARK-29588
> Project: Spark
>  Issue Type: Umbrella
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Improvements in JDBC/ODBC server page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS

2019-11-06 Thread Shyam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968435#comment-16968435
 ] 

Shyam commented on SPARK-25466:
---

[~gsomogyi] I am still facing the same issue how to fix it ? I tried these 
things mentioned here in SOF  
[https://stackoverflow.com/questions/58456939/how-to-set-spark-consumer-cache-to-fix-kafkaconsumer-cache-hitting-max-capaci]
 

> Documentation does not specify how to set Kafka consumer cache capacity for SS
> --
>
> Key: SPARK-25466
> URL: https://issues.apache.org/jira/browse/SPARK-25466
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> When hitting this warning with SS:
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30)
> If you Google you get to this page:
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> Which is for Spark Streaming and says to use this config item to adjust the 
> capacity: "spark.streaming.kafka.consumer.cache.maxCapacity".
> This is a bit confusing as SS uses a different config item: 
> "spark.sql.kafkaConsumerCache.capacity"
> Perhaps the SS Kafka documentation should talk about the consumer cache 
> capacity?  Perhaps here?
> https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
> Or perhaps the warning message should reference the config item.  E.g
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30).
>   *The cache size can be adjusted with the setting 
> "spark.sql.kafkaConsumerCache.capacity".*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28725) Spark ML not able to de-serialize Logistic Regression model saved in previous version of Spark

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28725.
--
Resolution: Invalid

> Spark ML not able to de-serialize Logistic Regression model saved in previous 
> version of Spark
> --
>
> Key: SPARK-28725
> URL: https://issues.apache.org/jira/browse/SPARK-28725
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
> Environment: PROD
>Reporter: Sharad Varshney
>Priority: Major
>
> Logistic Regression model saved using Spark version 2.3.0 in HDI is not 
> correctly de-serialized  with Spark 2.4.3 version. It loads into the memory 
> but probabilities it emits on inference is like 1.45 e-44(to 44th decimal 
> place approx equal to 0)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29325) approxQuantile() results are incorrect and vary significantly for small changes in relativeError

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29325.
--
Resolution: Duplicate

> approxQuantile() results are incorrect and vary significantly for small 
> changes in relativeError
> 
>
> Key: SPARK-29325
> URL: https://issues.apache.org/jira/browse/SPARK-29325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.4
> Environment: I was using OSX 10.14.6.
> I was using Scala 2.11.12 and Spark 2.4.4.
> I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2.
>Reporter: James Verbus
>Priority: Major
>  Labels: correctness
> Attachments: 20191001_example_data_approx_quantile_bug.zip
>
>
> The [approxQuantile() 
> method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40]
>  returns sometimes incorrect results that are sensitively dependent upon the 
> choice of the relativeError.
> Below is an example in the latest Spark version (2.4.4). You can see the 
> result varies significantly for modest changes in the specified relativeError 
> parameter. The result varies much more than would be expected based upon the 
> relativeError parameter.
>  
> {code:java}
> Welcome to
>                     __
>      / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>          
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_212)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("./20191001_example_data_approx_quantile_bug")
> df: org.apache.spark.sql.DataFrame = [value: double]
> scala> df.stat.approxQuantile("value", Array(0.9), 0)
> res0: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
> res1: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
> res2: Array[Double] = Array(0.5926195654486178)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
> res3: Array[Double] = Array(0.5924693999048418)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
> res4: Array[Double] = Array(0.67621027121925)
> scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
> res5: Array[Double] = Array(0.5923925937051544) 
> {code}
> I attached a zip file containing the data used for the above example 
> demonstrating the bug.
> Also, the following demonstrates that there is data for intermediate quantile 
> values between the 0.5926195654486178 and 0.67621027121925 values observed 
> above.
> {code:java}
> scala> df.stat.approxQuantile("value", Array(0.9), 0.0)
> res10: Array[Double] = Array(0.5929591082174609)
> scala> df.stat.approxQuantile("value", Array(0.91), 0.0)
> res11: Array[Double] = Array(0.5966354540849995)
> scala> df.stat.approxQuantile("value", Array(0.92), 0.0)
> res12: Array[Double] = Array(0.6015676591185595)
> scala> df.stat.approxQuantile("value", Array(0.93), 0.0)
> res13: Array[Double] = Array(0.6029240823799614)
> scala> df.stat.approxQuantile("value", Array(0.94), 0.0)
> res14: Array[Double] = Array(0.611764547134)
> scala> df.stat.approxQuantile("value", Array(0.95), 0.0)
> res15: Array[Double] = Array(0.6185162204274052)
> scala> df.stat.approxQuantile("value", Array(0.96), 0.0)
> res16: Array[Double] = Array(0.625983000807062)
> scala> df.stat.approxQuantile("value", Array(0.97), 0.0)
> res17: Array[Double] = Array(0.6306892943412258)
> scala> df.stat.approxQuantile("value", Array(0.98), 0.0)
> res18: Array[Double] = Array(0.6365567375994333)
> scala> df.stat.approxQuantile("value", Array(0.99), 0.0)
> res19: Array[Double] = Array(0.6554479197566019)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25880) user set some hadoop configurations can not work

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-25880.
--
Resolution: Not A Problem

> user set some hadoop configurations can not work
> 
>
> Key: SPARK-25880
> URL: https://issues.apache.org/jira/browse/SPARK-25880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: guojh
>Priority: Major
>
> When user set some hadoop configuration in spark-defaults.conf, for instance: 
> spark.hadoop.mapreduce.input.fileinputformat.split.maxsize   10
> and then user use the spark-sql and use set command to overwrite this 
> configuration, but it can not cover the value which set in the file of 
> spark-defaults.conf. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29453) Improve tooltip information for SQL tab

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29453.
--
Resolution: Not A Problem

> Improve tooltip information for SQL tab
> ---
>
> Key: SPARK-29453
> URL: https://issues.apache.org/jira/browse/SPARK-29453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29585) Duration in stagePage does not match Duration in Summary Metrics for Completed Tasks

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29585.
--
Resolution: Not A Problem

> Duration in stagePage does not match Duration in Summary Metrics for 
> Completed Tasks
> 
>
> Key: SPARK-29585
> URL: https://issues.apache.org/jira/browse/SPARK-29585
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: teeyog
>Priority: Major
>
> Summary Metrics for Completed Tasks uses  executorRunTime, and Duration in 
> Task details uses executorRunTime, but Duration in Completed Stages uses
> {code:java}
> stageData.completionTime - stageData.firstTaskLaunchedTime{code}
> , which results in different results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29548) Redirect system print stream to log4j and improve robustness

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29548.
--
Resolution: Won't Fix

> Redirect system print stream to log4j and improve robustness
> 
>
> Key: SPARK-29548
> URL: https://issues.apache.org/jira/browse/SPARK-29548
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> In a production environment, user behavior is highly random and uncertain.
> For example: Users use `System.out` or `System.err` to print information.
> But the system print stream may cause some trouble, such as: the disk file is 
> too large.
> In my production environment, it causes the disk to be full and let 
> [NodeManager] works not fine.
> A method of threat is to forbid the use of `System.out` or `System.err`. But 
> unfriendly to the users.
> A better method is to redirecting the system print stream to `Log4j` and 
> Spark can take advantage of `Log4j`'s split log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29751) Scalers use Summarizer instead of MultivariateOnlineSummarizer

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29751:


Assignee: zhengruifeng

> Scalers use Summarizer instead of MultivariateOnlineSummarizer
> --
>
> Key: SPARK-29751
> URL: https://issues.apache.org/jira/browse/SPARK-29751
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> I did performance tests and found that using Summarizer instead of 
> MultivariateOnlineSummarizer will speed up the fitting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29751) Scalers use Summarizer instead of MultivariateOnlineSummarizer

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29751.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26393
[https://github.com/apache/spark/pull/26393]

> Scalers use Summarizer instead of MultivariateOnlineSummarizer
> --
>
> Key: SPARK-29751
> URL: https://issues.apache.org/jira/browse/SPARK-29751
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> I did performance tests and found that using Summarizer instead of 
> MultivariateOnlineSummarizer will speed up the fitting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:46 PM:
---

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx] .   

 

Please let me know if anything else is required. I am totally stuck with this 
issue. 


was (Author: felixkjose):
[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
>

[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:45 PM:
---

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx]


was (Author: felixkjose):
[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
[^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at

[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-29764:
-
Attachment: SparkParquetSampleCode.docx

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
>

[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
[^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
>

[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-06 Thread Ankit raj boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968406#comment-16968406
 ] 

Ankit raj boudh commented on SPARK-29776:
-

i will start checking this issue.

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-06 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29776:


 Summary: rpad returning invalid value when parameter is empty
 Key: SPARK-29776
 URL: https://issues.apache.org/jira/browse/SPARK-29776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


As per rpad definition
 rpad
 rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
 If str is longer than len, the return value is shortened to len characters.
 *In case of empty pad string, the return value is null.*

Below is Example

In Spark:

0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
++
| rpad(hi, 5, ) |
++
| hi |
++

It should return NULL as per definition.

 

Hive behavior is correct as per definition it returns NULL when pad is empty 
String

INFO : Concurrency mode is disabled, not creating a lock manager
+---+
| _c0 |
+---+
| NULL |
+---+

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-06 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968403#comment-16968403
 ] 

Sean R. Owen commented on SPARK-29760:
--

That's standard SQL, no? and it works. I don't know if it needs to be 
documented specifically. Where would you mention it?

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29775) Support truncate multiple tables

2019-11-06 Thread jobit mathew (Jira)

jobit mathew created SPARK-29775:


 Summary: Support truncate multiple tables
 Key: SPARK-29775
 URL: https://issues.apache.org/jira/browse/SPARK-29775
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.4
Reporter: jobit mathew


Spark sql Support truncate single table like 

TRUNCATE table t1;

But postgresql support truncating multiple tables like 

TRUNCATE bigtable, fattable;

So spark also can support truncating multiple tables

[https://www.postgresql.org/docs/12/sql-truncate.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-06 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967500#comment-16967500
 ] 

jobit mathew edited comment on SPARK-29760 at 11/6/19 1:43 PM:
---

[~LI,Xiao] and [~dkbiswal] could you please confirm this.

[~srowen] any comment on this?


was (Author: jobitmathew):
[~LI,Xiao] and [~dkbiswal] could you please confirm this

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29746) implement validateInputType in Normalizer

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29746.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26388
[https://github.com/apache/spark/pull/26388]

> implement validateInputType in Normalizer
> -
>
> Key: SPARK-29746
> URL: https://issues.apache.org/jira/browse/SPARK-29746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> UnaryTransformer.transformSchema calls validateInputType method, but this 
> method is not implemented in ElementwiseProduct, Normalizer and 
> PolynomialExpansion. Will implement validateInput in ElementwiseProduct, 
> Normalizer and PolynomialExpansion.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29746) implement validateInputType in Normalizer

2019-11-06 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29746:


Assignee: Huaxin Gao

> implement validateInputType in Normalizer
> -
>
> Key: SPARK-29746
> URL: https://issues.apache.org/jira/browse/SPARK-29746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> UnaryTransformer.transformSchema calls validateInputType method, but this 
> method is not implemented in ElementwiseProduct, Normalizer and 
> PolynomialExpansion. Will implement validateInput in ElementwiseProduct, 
> Normalizer and PolynomialExpansion.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-11-06 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968346#comment-16968346
 ] 

Takeshi Yamamuro commented on SPARK-27763:
--

I've finished checking `create_table.sql` and `alter_table.sql` query-by-query 
and I think most parts of the queries in them are less meaning for Spark and 
the most issues there already have been filed in SPARK-27764. So, I think we 
could close this as resolved for now. WDYT?

 - create_table.sql: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/create_table.sql
 - alter_table.sql: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/alter_table.sql

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28296) Improved VALUES support

2019-11-06 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968332#comment-16968332
 ] 

jobit mathew commented on SPARK-28296:
--

[~petertoth] there are some more valid commands supporting in postgresql, but 
which are not supporting in spark sql.

 

VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only;

VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row;
 
||  ||column1||column2||
|1|1|one|
|2|2|two|
|3|3|three|
 
||  ||column1||column2||
|1|2|two|
|2|3|three|

but same in spark-sql

 

spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only;
Error in query:
mismatched input 'FIRST' expecting (line 1, pos 50)

== SQL ==
VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only
--^^^

spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row;
Error in query:
mismatched input '1' expecting (line 1, pos 51)

== SQL ==
VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row
---^^^

spark-sql>

 

> Improved VALUES support
> ---
>
> Key: SPARK-28296
> URL: https://issues.apache.org/jira/browse/SPARK-28296
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> These are valid queries in PostgreSQL, but they don't work in Spark SQL:
> {noformat}
> values ((select 1));
> values ((select c from test1));
> select (values(c)) from test10;
> with cte(foo) as ( values(42) ) values((select foo from cte));
> {noformat}
> where test1 and test10:
> {noformat}
> CREATE TABLE test1 (c INTEGER);
> INSERT INTO test1 VALUES(1);
> CREATE TABLE test10 (c INTEGER);
> INSERT INTO test10 SELECT generate_sequence(1, 10);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL

2019-11-06 Thread Alexander Ermakov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968329#comment-16968329
 ] 

Alexander Ermakov commented on SPARK-29773:
---

We will try on 2.4.4

> Unable to process empty ORC files in Hive Table using Spark SQL
> ---
>
> Key: SPARK-29773
> URL: https://issues.apache.org/jira/browse/SPARK-29773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Centos 7, Spark 2.3.1, Hive 2.3.0
>Reporter: Alexander Ermakov
>Priority: Major
>
> Unable to process empty ORC files in Hive Table using Spark SQL. It seems 
> that a problem with class 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits()
> Stack trace:
> {code:java}
> 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct 
> _tech_load_dt from dl_raw.tpaccsieee_ut_data_address]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(_tech_load_dt#1374, 200)
> +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], 
> output=[_tech_load_dt#1374])
>+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation 
> `dl_raw`.`tpaccsieee_ut_data_address`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, 
> address_adm#1309, address_md#1310, adress_doc#1311, building#1312, 
> change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, 
> city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, 
> code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, 
> date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, 
> district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields]   
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324)
> at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
>

[jira] [Updated] (SPARK-16872) Impl Gaussian Naive Bayes Classifier

2019-11-06 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-16872:
-
Summary: Impl Gaussian Naive Bayes Classifier  (was: Include Gaussian Naive 
Bayes Classifier)

> Impl Gaussian Naive Bayes Classifier
> 
>
> Key: SPARK-16872
> URL: https://issues.apache.org/jira/browse/SPARK-16872
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}.
> In GaussianNB model, the {{theta}} matrix is used to store means and there is 
> a extra {{sigma}} matrix storing the variance of each feature.
> GaussianNB in spark
> {code}
> scala> import org.apache.spark.ml.classification.GaussianNaiveBayes
> import org.apache.spark.ml.classification.GaussianNaiveBayes
> scala> val path = 
> "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> path: String = 
> /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt
> scala> val data = spark.read.format("libsvm").load(path).persist()
> data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> scala> val gnb = new GaussianNaiveBayes()
> gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c
> scala> val model = gnb.fit(data)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 
> storageLevel=StorageLevel(1 replicas)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished
> model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = 
> GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes
> scala> model.pi
> res0: org.apache.spark.ml.linalg.Vector = 
> [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098]
> scala> model.pi.toArray.map(math.exp)
> res1: Array[Double] = Array(0., 0., 
> 0.)
> scala> model.theta
> res2: org.apache.spark.ml.linalg.Matrix =
> 0.270067018001   -0.188540006  0.543050720001   0.60546
> -0.60779998  0.18172   -0.842711740006  
> -0.88139998
> -0.091425964 -0.35858001   0.105084738  
> 0.021666701507102017
> scala> model.sigma
> res3: org.apache.spark.ml.linalg.Matrix =
> 0.1223012510889361   0.07078051983960698  0.0343595243976   
> 0.051336071297393815
> 0.03758145300924998  0.09880280046403413  0.003390296940069426  
> 0.007822241779598893
> 0.08058763609659315  0.06701386661293329  0.024866409227781675  
> 0.02661391644759426
> scala> model.transform(data).select("probability").take(10)
> [rdd_68_0]
> res4: Array[org.apache.spark.sql.Row] = 
> Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], 
> [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], 
> [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], 
> [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], 
> [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], 
> [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], 
> [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], 
> [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], 
> [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], 
> [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]])
> scala> model.transform(data).select("prediction").take(10)
> [rdd_68_0]
> res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [0.0], [1.0], [2.0], [2.0], [1.0], [0.0])
> {code}
> GaussianNB in scikit-learn
> {code}
> import numpy as np
> from sklearn.naive_bayes import GaussianNB
> from sklearn.datasets import load_svmlight_file
> path = 
> '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt'
> X, y = load_svmlight_file(path)
> X = X.toarray()
> clf = GaussianNB()
> clf.fit(X, y)
> >>> clf.class_prior_
> array([ 0.,  0.,  0.])
> >>> clf.theta_
> array([[ 0.2701, -0.1885,  0.54305072,  0.6055],
>[-0.6078,  0.1817, -0.84271174, -0.8814],
>[-0.0914, -0.3586,  0.10508474,  0.0216667 ]])
>
> >>> clf.sigma_
> array([[ 0.12230125,  0.07078052,  0.03430001,  0.05133607],

1 2 >

1 - 100 of 135 matches

Mail list logo