date:20200105

[jira] [Created] (SPARK-30421) Dropped columns still available for filtering

2020-01-05 Thread Tobias Hermann (Jira)

Tobias Hermann created SPARK-30421:
--

 Summary: Dropped columns still available for filtering
 Key: SPARK-30421
 URL: https://issues.apache.org/jira/browse/SPARK-30421
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Tobias Hermann


The following minimal example:
{quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
df.select("foo").where($"bar" === "a").show
df.drop("bar").where($"bar" === "a").show
{quote}
should result in an error like the following:
{quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
input columns: [foo];
{quote}
However, it does not but instead works without error, as if the column "bar" 
would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-05 Thread Tobias Hermann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008301#comment-17008301
 ] 

Tobias Hermann commented on SPARK-30421:


see: 
[https://stackoverflow.com/questions/59597678/why-does-filtering-on-a-non-existing-non-selected-column-work]

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30422) deprecate UserDefinedAggregateFunction in favor of SPARK-27296

2020-01-05 Thread Erik Erlandson (Jira)

Erik Erlandson created SPARK-30422:
--

 Summary: deprecate UserDefinedAggregateFunction in favor of 
SPARK-27296
 Key: SPARK-30422
 URL: https://issues.apache.org/jira/browse/SPARK-30422
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30423) Deprecate UserDefinedAggregateFunction

2020-01-05 Thread Erik Erlandson (Jira)

Erik Erlandson created SPARK-30423:
--

 Summary: Deprecate UserDefinedAggregateFunction
 Key: SPARK-30423
 URL: https://issues.apache.org/jira/browse/SPARK-30423
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson
Assignee: Erik Erlandson


Anticipating the merging of SPARK-27296, the legacy methodology for 
implementing custom user defined aggregators over untyped DataFrame based on 
UserDefinedAggregateFunction will be made obsolete. This class should be 
annotated as deprecated once the new capability is officially merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30424) Change ExpressionEncoder toRow method to return UnsafeRow

2020-01-05 Thread Erik Erlandson (Jira)

Erik Erlandson created SPARK-30424:
--

 Summary: Change ExpressionEncoder toRow method to return UnsafeRow
 Key: SPARK-30424
 URL: https://issues.apache.org/jira/browse/SPARK-30424
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson


[~wenchen] observed that the toRow() method on ExpressionEncoder can have its 
return type specified as UnsafeRow. See discussion on 
[https://github.com/apache/spark/pull/25024] 

 

Not a high priority but could be done for 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning

2020-01-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008438#comment-17008438
 ] 

Dongjoon Hyun commented on SPARK-25603:
---

[~maropu]. Thank you for pinging me. Let's consider that at the end of this 
month.

> Generalize Nested Column Pruning
> 
>
> Key: SPARK-25603
> URL: https://issues.apache.org/jira/browse/SPARK-25603
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25603) Generalize Nested Column Pruning

2020-01-05 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008447#comment-17008447
 ] 

Takeshi Yamamuro commented on SPARK-25603:
--

Looks nice, thanks, [~dongjoon]

> Generalize Nested Column Pruning
> 
>
> Key: SPARK-25603
> URL: https://issues.apache.org/jira/browse/SPARK-25603
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25464) Dropping database can remove the hive warehouse directory contents

2020-01-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-25464.
--
Resolution: Not A Problem

> Dropping database can remove the hive warehouse directory contents
> --
>
> Key: SPARK-25464
> URL: https://issues.apache.org/jira/browse/SPARK-25464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Sushanta Sen
>Priority: Major
>
> Create Database.
> CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name [COMMENT comment_text] 
> [*LOCATION*path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)]           
> \{{LOCATION }}If the specified path does not already exist in the underlying 
> file system, this command tries to create a directory with the path. *When 
> the database is dropped later, this directory is not deleted, 
> {color:#d04437}but currently it is deleting the directory as well.{color}
> {color:#33}please refer the below link{color}
> {color:#d04437}[databricks documentation|{color}
>  
> [https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html]
>  {color:#d04437}]{color}
> if i create the database as below
> create database db1 location '/user/hive/warehouse'; //this is hive warehouse 
> directory   
> *{color:#33}on dropping this db it will also delete the warehouse 
> directory which contains the other db information.{color}*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression

2020-01-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27258.
--
Resolution: Won't Fix

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30418) make FM call super class method extractLabeledPoints

2020-01-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30418:


Assignee: Huaxin Gao

> make FM call super class method extractLabeledPoints
> 
>
> Key: SPARK-30418
> URL: https://issues.apache.org/jira/browse/SPARK-30418
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> make FMClassifier/Regressor call super class method extractLabeledPoints



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30418) make FM call super class method extractLabeledPoints

2020-01-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30418.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27093
[https://github.com/apache/spark/pull/27093]

> make FM call super class method extractLabeledPoints
> 
>
> Key: SPARK-30418
> URL: https://issues.apache.org/jira/browse/SPARK-30418
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> make FMClassifier/Regressor call super class method extractLabeledPoints



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23432) Expose executor memory metrics in the web UI for executors

2020-01-05 Thread Zhongwei Zhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008493#comment-17008493
 ] 

Zhongwei Zhu commented on SPARK-23432:
--

I'll work on this. 

> Expose executor memory metrics in the web UI for executors
> --
>
> Key: SPARK-23432
> URL: https://issues.apache.org/jira/browse/SPARK-23432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edward Lu
>Priority: Major
>
> Add the new memory metrics (jvmUsedMemory, executionMemory, storageMemory, 
> and unifiedMemory, etc.) to the executors tab, in the summary and for each 
> executor.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning

2020-01-05 Thread Haifeng Chen (Jira)

Haifeng Chen created SPARK-30425:


 Summary: FileScan of Data Source V2 doesn't implement Partition 
Pruning
 Key: SPARK-30425
 URL: https://issues.apache.org/jira/browse/SPARK-30425
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Haifeng Chen


I was trying to understand how Data Source V2 handling partition pruning,  I 
didn't find the code anywhere which filtering out the unnecessary files in 
current Data Source V2 implementation. For a File data source, the base class 
FileScan of Data Source V2 possibly should handle this in "partitions" method. 
But the current implementation is like the following:

protected def partitions: Seq[FilePartition] = {
 val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)

 

listFiles passed to empty sequence where no files will be filtered by the 
partition filter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30426) Fix disorder the structured-streaming-kafka-integration page

2020-01-05 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-30426:
---

 Summary: Fix disorder the structured-streaming-kafka-integration 
page
 Key: SPARK-30426
 URL: https://issues.apache.org/jira/browse/SPARK-30426
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Structured Streaming
Affects Versions: 3.0.0
Reporter: Yuanjian Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30426) Fix the disorder of structured-streaming-kafka-integration page

2020-01-05 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-30426:

Summary: Fix the disorder of structured-streaming-kafka-integration page  
(was: Fix disorder the structured-streaming-kafka-integration page)

> Fix the disorder of structured-streaming-kafka-integration page
> ---
>
> Key: SPARK-30426
> URL: https://issues.apache.org/jira/browse/SPARK-30426
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29596) Task duration not updating for running tasks

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008536#comment-17008536
 ] 

Hyukjin Kwon commented on SPARK-29596:
--

[~726575...@qq.com] have you made some progresses on this?

> Task duration not updating for running tasks
> 
>
> Key: SPARK-29596
> URL: https://issues.apache.org/jira/browse/SPARK-29596
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.2
>Reporter: Bharati Jadhav
>Priority: Major
> Attachments: Screenshot_Spark_live_WebUI.png
>
>
> When looking at the task metrics for running tasks in the task table for the 
> related stage, the duration column is not updated until the task has 
> succeeded. The duration values are reported empty or 0 ms until the task has 
> completed. This is a change in behavior, from earlier versions, when the task 
> duration was continuously updated while the task was running. The missing 
> duration values can be observed for both short and long running tasks and for 
> multiple applications.
>  
> To reproduce this, one can run any code from the spark-shell and observe the 
> missing duration values for any running task. Only when the task succeeds is 
> the duration value populated in the UI.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008537#comment-17008537
 ] 

Hyukjin Kwon commented on SPARK-30196:
--

[~larsfrancke] does that happen after this upgrade?

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30426) Fix the disorder of structured-streaming-kafka-integration page

2020-01-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30426.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27098
[https://github.com/apache/spark/pull/27098]

> Fix the disorder of structured-streaming-kafka-integration page
> ---
>
> Key: SPARK-30426
> URL: https://issues.apache.org/jira/browse/SPARK-30426
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30426) Fix the disorder of structured-streaming-kafka-integration page

2020-01-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30426:
---

Assignee: Yuanjian Li

> Fix the disorder of structured-streaming-kafka-integration page
> ---
>
> Key: SPARK-30426
> URL: https://issues.apache.org/jira/browse/SPARK-30426
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30422) deprecate UserDefinedAggregateFunction in favor of SPARK-27296

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30422.
--
Resolution: Duplicate

> deprecate UserDefinedAggregateFunction in favor of SPARK-27296
> --
>
> Key: SPARK-30422
> URL: https://issues.apache.org/jira/browse/SPARK-30422
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30364) The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le

2020-01-05 Thread AK97 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AK97 updated SPARK-30364:
-
Description: 
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
spark-streaming-kafka-0-10_2.11 test cases are failing with following error :

[ERROR] 
/opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
 Symbol 'term org.eclipse' is missing from the classpath.
This symbol is required by 'method 
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
Make sure that term eclipse is in your classpath and check for conflicting 
dependencies with `-Ylog-classpath`.
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.
[ERROR] testUtils.sendMessages(topic, data.toArray) 
 ^


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.
This error is also seen on x86 as well.

  was:
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
spark-streaming-kafka-0-10_2.11 test cases are failing with following error :

[ERROR] 
/opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
 Symbol 'term org.eclipse' is missing from the classpath.
This symbol is required by 'method 
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
Make sure that term eclipse is in your classpath and check for conflicting 
dependencies with `-Ylog-classpath`.
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.
[ERROR] testUtils.sendMessages(topic, data.toArray) 
 ^


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.


> The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le
> -
>
> Key: SPARK-30364
> URL: https://issues.apache.org/jira/browse/SPARK-30364
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the spark-streaming-kafka-0-10_2.11 test cases are failing with following 
> error :
> [ERROR] 
> /opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
>  Symbol 'term org.eclipse' is missing from the classpath.
> This symbol is required by 'method 
> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> Make sure that term eclipse is in your classpath and check for conflicting 
> dependencies with `-Ylog-classpath`.
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.
> [ERROR] testUtils.sendMessages(topic, data.toArray)   
>^
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.
> This error is also seen on x86 as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30427) Add config item for limiting partition number when calculating statistics through HDFS

2020-01-05 Thread Hu Fuwang (Jira)

Hu Fuwang created SPARK-30427:
-

 Summary: Add config item for limiting partition number when 
calculating statistics through HDFS
 Key: SPARK-30427
 URL: https://issues.apache.org/jira/browse/SPARK-30427
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29596) Task duration not updating for running tasks

2020-01-05 Thread daile (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008552#comment-17008552
 ] 

daile commented on SPARK-29596:
---

[~hyukjin.kwon]  task detail list use task.taskMetrics info , but 
task.taskMetrics will only be updated when task finshed ,  Is it feasible to 
get task Duration in its running state ?  
https://github.com/apache/spark/pull/27026

> Task duration not updating for running tasks
> 
>
> Key: SPARK-29596
> URL: https://issues.apache.org/jira/browse/SPARK-29596
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.2
>Reporter: Bharati Jadhav
>Priority: Major
> Attachments: Screenshot_Spark_live_WebUI.png
>
>
> When looking at the task metrics for running tasks in the task table for the 
> related stage, the duration column is not updated until the task has 
> succeeded. The duration values are reported empty or 0 ms until the task has 
> completed. This is a change in behavior, from earlier versions, when the task 
> duration was continuously updated while the task was running. The missing 
> duration values can be observed for both short and long running tasks and for 
> multiple applications.
>  
> To reproduce this, one can run any code from the spark-shell and observe the 
> missing duration values for any running task. Only when the task succeeds is 
> the duration value populated in the UI.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30411:
-
Description: 
{code}
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
{code}

{code}
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
{code}

Now after saveAsTable

{code}
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 {code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
{code}

 Overwrites the permissions

Insert into honors preserving parent directory permissions.

{code}
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
{code}

 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed

  was:
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example

Now after saveAsTable
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
 Overwrites the permissions

Insert into honors preserving parent directory permissions.
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed


> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after saveAsTable
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
>  {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Updated] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30411:
-
Description: 
{code}
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
{code}

{code}
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
{code}

Now after {{saveAsTable}}

{code}
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 {code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
{code}

 Overwrites the permissions

Insert into honors preserving parent directory permissions.

{code}
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
{code}

 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed

  was:
{code}
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
{code}

{code}
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
{code}

Now after saveAsTable

{code}
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 {code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
{code}

 Overwrites the permissions

Insert into honors preserving parent directory permissions.

{code}
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
{code}

 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed


> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
>  {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30411:
-
Description: 
{code}
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
{code}

{code}
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
{code}

Now after {{saveAsTable}}

{code}
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
{code}

 Overwrites the permissions

Insert into honors preserving parent directory permissions.

{code}
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
{code}

 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed

  was:
{code}
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
{code}

{code}
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
{code}

Now after {{saveAsTable}}

{code}
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 {code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
{code}

 Overwrites the permissions

Insert into honors preserving parent directory permissions.

{code}
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')
{code}

{code}
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
{code}

 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed


> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Commented] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008556#comment-17008556
 ] 

Hyukjin Kwon commented on SPARK-30411:
--

I think it's because it uses Spark's native ORC implementation, which makes 
sense not to respect Hive properties.
Can you try to set `spark.sql.orc.impl` to `hive`?

> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30400:
-
Description: 
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
test cases are failing in SQL module with following error :

{code}
- CREATE TABLE USING AS SELECT based on the file without write permission *** 
FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown (CreateTableAsSelectSuite.scala:92)

- create a table, drop it and create another one with the same name *** FAILED 
***
  org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
exists. You need to drop it first.;
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
{code}

Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.

  was:
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
test cases are failing in SQL module with following error :

- CREATE TABLE USING AS SELECT based on the file without write permission *** 
FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown (CreateTableAsSelectSuite.scala:92)

- create a table, drop it and create another one with the same name *** FAILED 
***
  org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
exists. You need to drop it first.;
at 
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
  at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.


> Test failure in SQL module on ppc64le
> -
>
> Key: SPARK-30400
> URL: https://issues.apache.org/jira/browse/SPARK-30400
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the test cases are failing in SQL module with following error :
> {code}
> - CREATE TABLE USING AS SELECT based on the file without write permission *** 
> FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown (CreateTableAsSelectSuite.scala:92)
> - create a table, drop it and create another one with the same name *** 
> FAILED ***
>   org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
> exists. You need to drop it first.;
> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
>   at org.apache.spark.sql.Da

[jira] [Resolved] (SPARK-30399) Bucketing does not compatible with partitioning in practice

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30399.
--
Resolution: Invalid

[~shay_elbaz], please ask questions into mailing lists (see 
https://spark.apache.org/community.html). I think you could have a better 
answer.

> Bucketing does not compatible with partitioning in practice
> ---
>
> Key: SPARK-30399
> URL: https://issues.apache.org/jira/browse/SPARK-30399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: HDP 2.7
>Reporter: Shay Elbaz
>Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the 
> number of buckets for the map-side join 
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
> tables, but quite disastrous for _time-partitioned_ tables. In our use case, 
> a daily partitioned key-value table is added 100GB of data every day. So in 
> 100 days there are 10TB of data we want to join with. Aiming to this 
> scenario, we need thousands of buckets if we want every task to successfully 
> *read and sort* all of it's data in a map-side join. But in such case, every 
> daily increment would emit thousands of small files, leading to other big 
> issues.
> In practice, and with a hope for some hidden optimization, we set the number 
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join 
> with the smallest input, every executor was killed by Yarn due to over 
> allocating memory in the sorting phase. Even without such failures, it would 
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - 
> is it by design that buckets cannot be used in time-partitioned table, or am 
> I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30397) [pyspark] Writer applied to custom model changes type of keys' dict from int to str

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30397:
-
Component/s: ML

> [pyspark] Writer applied to custom model changes type of keys' dict from int 
> to str
> ---
>
> Key: SPARK-30397
> URL: https://issues.apache.org/jira/browse/SPARK-30397
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.4
>Reporter: Jean-Marc Montanier
>Priority: Major
>
> Hello,
>  
> I have a custom model that I'm trying to persist. Within this custom model 
> there is a python dict mapping from int to int. When the model is saved (with 
> write().save('path')), the keys of the dict are modified from int to str.
>  
> You can find bellow a code to reproduce the issue:
> {code:python}
> #!/usr/bin/env python3
> # -*- coding: utf-8 -*-
> """
> @author: Jean-Marc Montanier
> @date: 2019/12/31
> """
> from pyspark.sql import SparkSession
> from pyspark import keyword_only
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml import Estimator, Model
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param import Param, Params
> from pyspark.ml.param.shared import HasInputCol, HasOutputCol
> from pyspark.sql.types import IntegerType
> from pyspark.sql.functions import udf
> spark = SparkSession \
> .builder \
> .appName("ImputeNormal") \
> .getOrCreate()
> class CustomFit(Estimator,
> HasInputCol,
> HasOutputCol,
> DefaultParamsReadable,
> DefaultParamsWritable,
> ):
> @keyword_only
> def __init__(self, inputCol="inputCol", outputCol="outputCol"):
> super(CustomFit, self).__init__()
> self._setDefault(inputCol="inputCol", outputCol="outputCol")
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol="inputCol", outputCol="outputCol"):
> """
> setParams(self, inputCol="inputCol", outputCol="outputCol")
> """
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _fit(self, data):
> inputCol = self.getInputCol()
> outputCol = self.getOutputCol()
> categories = data.where(data[inputCol].isNotNull()) \
> .groupby(inputCol) \
> .count() \
> .orderBy("count", ascending=False) \
> .limit(2)
> categories = dict(categories.toPandas().set_index(inputCol)["count"])
> for cat in categories:
> categories[cat] = int(categories[cat])
> return CustomModel(categories=categories,
>input_col=inputCol,
>output_col=outputCol)
> class CustomModel(Model,
>   DefaultParamsReadable,
>   DefaultParamsWritable):
> input_col = Param(Params._dummy(), "input_col", "Name of the input 
> column")
> output_col = Param(Params._dummy(), "output_col", "Name of the output 
> column")
> categories = Param(Params._dummy(), "categories", "Top categories")
> def __init__(self, categories: dict = None, input_col="input_col", 
> output_col="output_col"):
> super(CustomModel, self).__init__()
> self._set(categories=categories, input_col=input_col, 
> output_col=output_col)
> def get_output_col(self) -> str:
> """
> output_col getter
> :return:
> """
> return self.getOrDefault(self.output_col)
> def get_input_col(self) -> str:
> """
> input_col getter
> :return:
> """
> return self.getOrDefault(self.input_col)
> def get_categories(self):
> """
> categories getter
> :return:
> """
> return self.getOrDefault(self.categories)
> def _transform(self, data):
> input_col = self.get_input_col()
> output_col = self.get_output_col()
> categories = self.get_categories()
> def get_cat(val):
> if val is None:
> return -1
> if val not in categories:
> return -1
> return int(categories[val])
> get_cat_udf = udf(get_cat, IntegerType())
> df = data.withColumn(output_col,
>  get_cat_udf(input_col))
> return df
> def test_without_write():
> fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + 
> [[None]] * 2, ['input'])
> custom_fit = CustomFit(inputCol='input', outputCol='output')
> pipeline = Pipeline(stages=[custom_fit])
> pipeline_model = pipeline.fit(fit_df)
> print("Categories: {}".format(pipeline_mod

[jira] [Resolved] (SPARK-30397) [pyspark] Writer applied to custom model changes type of keys' dict from int to str

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30397.
--
Resolution: Not A Problem

> [pyspark] Writer applied to custom model changes type of keys' dict from int 
> to str
> ---
>
> Key: SPARK-30397
> URL: https://issues.apache.org/jira/browse/SPARK-30397
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.4
>Reporter: Jean-Marc Montanier
>Priority: Major
>
> Hello,
>  
> I have a custom model that I'm trying to persist. Within this custom model 
> there is a python dict mapping from int to int. When the model is saved (with 
> write().save('path')), the keys of the dict are modified from int to str.
>  
> You can find bellow a code to reproduce the issue:
> {code:python}
> #!/usr/bin/env python3
> # -*- coding: utf-8 -*-
> """
> @author: Jean-Marc Montanier
> @date: 2019/12/31
> """
> from pyspark.sql import SparkSession
> from pyspark import keyword_only
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml import Estimator, Model
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param import Param, Params
> from pyspark.ml.param.shared import HasInputCol, HasOutputCol
> from pyspark.sql.types import IntegerType
> from pyspark.sql.functions import udf
> spark = SparkSession \
> .builder \
> .appName("ImputeNormal") \
> .getOrCreate()
> class CustomFit(Estimator,
> HasInputCol,
> HasOutputCol,
> DefaultParamsReadable,
> DefaultParamsWritable,
> ):
> @keyword_only
> def __init__(self, inputCol="inputCol", outputCol="outputCol"):
> super(CustomFit, self).__init__()
> self._setDefault(inputCol="inputCol", outputCol="outputCol")
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol="inputCol", outputCol="outputCol"):
> """
> setParams(self, inputCol="inputCol", outputCol="outputCol")
> """
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _fit(self, data):
> inputCol = self.getInputCol()
> outputCol = self.getOutputCol()
> categories = data.where(data[inputCol].isNotNull()) \
> .groupby(inputCol) \
> .count() \
> .orderBy("count", ascending=False) \
> .limit(2)
> categories = dict(categories.toPandas().set_index(inputCol)["count"])
> for cat in categories:
> categories[cat] = int(categories[cat])
> return CustomModel(categories=categories,
>input_col=inputCol,
>output_col=outputCol)
> class CustomModel(Model,
>   DefaultParamsReadable,
>   DefaultParamsWritable):
> input_col = Param(Params._dummy(), "input_col", "Name of the input 
> column")
> output_col = Param(Params._dummy(), "output_col", "Name of the output 
> column")
> categories = Param(Params._dummy(), "categories", "Top categories")
> def __init__(self, categories: dict = None, input_col="input_col", 
> output_col="output_col"):
> super(CustomModel, self).__init__()
> self._set(categories=categories, input_col=input_col, 
> output_col=output_col)
> def get_output_col(self) -> str:
> """
> output_col getter
> :return:
> """
> return self.getOrDefault(self.output_col)
> def get_input_col(self) -> str:
> """
> input_col getter
> :return:
> """
> return self.getOrDefault(self.input_col)
> def get_categories(self):
> """
> categories getter
> :return:
> """
> return self.getOrDefault(self.categories)
> def _transform(self, data):
> input_col = self.get_input_col()
> output_col = self.get_output_col()
> categories = self.get_categories()
> def get_cat(val):
> if val is None:
> return -1
> if val not in categories:
> return -1
> return int(categories[val])
> get_cat_udf = udf(get_cat, IntegerType())
> df = data.withColumn(output_col,
>  get_cat_udf(input_col))
> return df
> def test_without_write():
> fit_df = spark.createDataFrame([[10]] * 5 + [[11]] * 4 + [[12]] * 3 + 
> [[None]] * 2, ['input'])
> custom_fit = CustomFit(inputCol='input', outputCol='output')
> pipeline = Pipeline(stages=[custom_fit])
> pipeline_model = pipeline.fit(fit_df)
> print("Categories: {}".format(

[jira] [Resolved] (SPARK-30393) Too much ProvisionedThroughputExceededException while recover from checkpoint

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30393.
--
Resolution: Invalid

Please ask questions into mailing list or stackoverflow (see 
https://spark.apache.org/community.html). You could have a better answer.

> Too much ProvisionedThroughputExceededException while recover from checkpoint
> -
>
> Key: SPARK-30393
> URL: https://issues.apache.org/jira/browse/SPARK-30393
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
> Environment: I am using EMR 5.25.0, Spark 2.4.3, 
> spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty 
> of memory. 6 kinesis shards, I even increased to 12 shards but still see the 
> kinesis error
>Reporter: Stephen
>Priority: Major
> Attachments: kinesisexceedreadlimit.png, 
> kinesisusagewhilecheckpointrecoveryerror.png, 
> sparkuiwhilecheckpointrecoveryerror.png
>
>
> I have a spark application which consume from Kinesis with 6 shards. Data was 
> produced to Kinesis at at most 2000 records/second. At non peak time data 
> only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards 
> is enough to handle that.
> I use reduceByKeyAndWindow and mapWithState in the program and the sliding 
> window is one hour long.
> Recently I am trying to checkpoint the application to S3. I am testing this 
> at nonpeak time so the data incoming rate is very low like 200 records/sec. I 
> run the Spark application by creating new context, checkpoint is created at 
> s3, but when I kill the app and restarts, it failed to recover from 
> checkpoint, and the error message is the following and my SparkUI shows all 
> the batches are stucked, and it takes a long time for the checkpoint recovery 
> to complete, 15 minutes to over an hour.
> I found lots of error message in the log related to Kinesis exceeding read 
> limit:
> {quote}19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 
> (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting shard 
> iterator from sequence number 
> 49601654074184110438492229476281538439036626028298502210, last exception:
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
> bq. at scala.Option.getOrElse(Option.scala:121)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
> bq. at 
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> bq. at org.apache.spark.scheduler.Task.run(Task.scala:121)
> bq. at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> bq. at 
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> bq. at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> bq. at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> bq. at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> bq. at java.lang.Thread.run(Thread.java:748)
> bq. Caused by: 
> com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: 
> Rate exceeded for shard shardId-0004 in stream my-stream-name under 
> account my-account-number. (Service: AmazonKinesis; Status Code: 400; Error 
> Code: Provi

[jira] [Resolved] (SPARK-30372) Modify spark-redshift to work with iam on aws

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30372.
--
Resolution: Invalid

It's not a Spark issue. please report at 
https://github.com/databricks/spark-redshift

> Modify spark-redshift to work with iam on aws
> -
>
> Key: SPARK-30372
> URL: https://issues.apache.org/jira/browse/SPARK-30372
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Sveta
>Priority: Trivial
>
> Based on was documentation:
> [https://docs.aws.amazon.com/redshift/latest/mgmt/generating-iam-credentials-configure-jdbc-odbc.html]
> JDBC URL to connect to Redshift with the IAM credentials username and 
> password might not be included in it.
> However, from spark side there are mandatory checks for url to username and 
> password in url.
> Examples may be found in: _+com.databricks.spark.redshift.Parameters.scala+_ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30365) When deploy mode is a client, why doesn't it support remote "spark.files" download?

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30365.
--
Resolution: Invalid

Please ask questions into mailing list or stackoverflow (see 
https://spark.apache.org/community.html). You could have a better answer.

> When deploy mode is a client, why doesn't it support remote "spark.files" 
> download?
> ---
>
> Key: SPARK-30365
> URL: https://issues.apache.org/jira/browse/SPARK-30365
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.3.2
> Environment: {code:java}
>  ./bin/spark-submit \
> --master yarn  \
> --deploy-mode client \
> ..{code}
>Reporter: wangzhun
>Priority: Major
>
> {code:java}
> // In client mode, download remote files.
> var localPrimaryResource: String = null
> var localJars: String = null
> var localPyFiles: String = null
> if (deployMode == CLIENT) {
>   localPrimaryResource = Option(args.primaryResource).map {
> downloadFile(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
>   localJars = Option(args.jars).map {
> downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
>   localPyFiles = Option(args.pyFiles).map {
> downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
> }
> {code}
> The above Spark2.3 SparkSubmit code does not download the corresponding file 
> of "spark.files".
> I think it is possible to download remote files locally and add them to 
> classPath.
> For example, can support --files configuration remote hive-site.xml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30364) The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30364:
-
Description: 
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
spark-streaming-kafka-0-10_2.11 test cases are failing with following error :

{code}
[ERROR] 
/opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
 Symbol 'term org.eclipse' is missing from the classpath.
This symbol is required by 'method 
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
Make sure that term eclipse is in your classpath and check for conflicting 
dependencies with `-Ylog-classpath`.
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.
[ERROR] testUtils.sendMessages(topic, data.toArray) 
 ^
{code}


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.

  was:
I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, the 
spark-streaming-kafka-0-10_2.11 test cases are failing with following error :

[ERROR] 
/opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
 Symbol 'term org.eclipse' is missing from the classpath.
This symbol is required by 'method 
org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
Make sure that term eclipse is in your classpath and check for conflicting 
dependencies with `-Ylog-classpath`.
A full rebuild may help if 'MetricsSystem.class' was compiled against an 
incompatible version of org.
[ERROR] testUtils.sendMessages(topic, data.toArray) 
 ^


Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.
This error is also seen on x86 as well.


> The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le
> -
>
> Key: SPARK-30364
> URL: https://issues.apache.org/jira/browse/SPARK-30364
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the spark-streaming-kafka-0-10_2.11 test cases are failing with following 
> error :
> {code}
> [ERROR] 
> /opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
>  Symbol 'term org.eclipse' is missing from the classpath.
> This symbol is required by 'method 
> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> Make sure that term eclipse is in your classpath and check for conflicting 
> dependencies with `-Ylog-classpath`.
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.
> [ERROR] testUtils.sendMessages(topic, data.toArray)   
>^
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30364) The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30364:
-
Component/s: DStreams

> The spark-streaming-kafka-0-10_2.11 test cases are failing on ppc64le
> -
>
> Key: SPARK-30364
> URL: https://issues.apache.org/jira/browse/SPARK-30364
> Project: Spark
>  Issue Type: Test
>  Components: Build, DStreams
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the spark-streaming-kafka-0-10_2.11 test cases are failing with following 
> error :
> {code}
> [ERROR] 
> /opt/spark/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
>  Symbol 'term org.eclipse' is missing from the classpath.
> This symbol is required by 'method 
> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> Make sure that term eclipse is in your classpath and check for conflicting 
> dependencies with `-Ylog-classpath`.
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.
> [ERROR] testUtils.sendMessages(topic, data.toArray)   
>^
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30357) SparkContext: Invoking stop() from shutdown hook

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30357:
-
Description: 
I'm getting below error while running spark-submit job in kubernetes , i didn't 
get any specific error

{code}
19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
executors
19/12/26 07:38:18 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
been closed (this is expected if the application is shutting down.)
19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
19/12/26 07:38:18 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a
{code}

 

 

 

 

  was:
I'm getting below error while running spark-submit job in kubernetes , i didn't 
get any specific error

 

19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
executors
19/12/26 07:38:18 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
been closed (this is expected if the application is shutting down.)
19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
19/12/26 07:38:18 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
/var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a

 

 

 

 


> SparkContext: Invoking stop() from shutdown hook
> 
>
> Key: SPARK-30357
> URL: https://issues.apache.org/jira/browse/SPARK-30357
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes, MLlib, PySpark
>Affects Versions: 2.4.4
> Environment: pivotal kubernetes container service PKS  having 3master 
> nodes and 3 worker nodes
>  
> OS : ubuntu for k8-s cluster
>  
>  
>Reporter: Veera
>Priority: Major
>  Labels: newbie
>
> I'm getting below error while running spark-submit job in kubernetes , i 
> didn't get any specific error
> {code}
> 19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
> 19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
> http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
> 19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 19/12/26 07:38:18 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> 19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 19/12/26 07:38:18 INFO Memory

[jira] [Updated] (SPARK-30357) SparkContext: Invoking stop() from shutdown hook

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30357:
-
Target Version/s:   (was: 2.4.4)

> SparkContext: Invoking stop() from shutdown hook
> 
>
> Key: SPARK-30357
> URL: https://issues.apache.org/jira/browse/SPARK-30357
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes, MLlib, PySpark
>Affects Versions: 2.4.4
> Environment: pivotal kubernetes container service PKS  having 3master 
> nodes and 3 worker nodes
>  
> OS : ubuntu for k8-s cluster
>  
>  
>Reporter: Veera
>Priority: Major
>  Labels: newbie
>
> I'm getting below error while running spark-submit job in kubernetes , i 
> didn't get any specific error
> {code}
> 19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
> 19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
> http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
> 19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 19/12/26 07:38:18 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> 19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
> 19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
> 19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
> 19/12/26 07:38:18 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
> 19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30357) SparkContext: Invoking stop() from shutdown hook

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30357:
-
Labels:   (was: newbie)

> SparkContext: Invoking stop() from shutdown hook
> 
>
> Key: SPARK-30357
> URL: https://issues.apache.org/jira/browse/SPARK-30357
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes, MLlib, PySpark
>Affects Versions: 2.4.4
> Environment: pivotal kubernetes container service PKS  having 3master 
> nodes and 3 worker nodes
>  
> OS : ubuntu for k8-s cluster
>  
>  
>Reporter: Veera
>Priority: Major
>
> I'm getting below error while running spark-submit job in kubernetes , i 
> didn't get any specific error
> {code}
> 19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
> 19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
> http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
> 19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 19/12/26 07:38:18 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> 19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
> 19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
> 19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
> 19/12/26 07:38:18 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
> 19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30340) Python tests failed on arm64/x86

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30340:
-
Description: 
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

{code}
File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
 Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
 Expected:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 Got:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 
 **
 File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
 Failed example:
 model.factors
 Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
 Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
 **
 2 of 10 in __main__.FMClassifier
 ***Test Failed*** 2 failures.
{code}
 

The details see 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]

And seems the tests failed on x86：

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console]

  was:
Jenkins job spark-master-test-python-arm failed after the commit 
c6ab7165dd11a0a7b8aea4c805409088e9a41a74:

File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2790, in __main__.FMClassifier
 Failed example:
 model.transform(test0).select("features", "probability").show(10, False)
 Expected:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 Got:
 +--++
|features|probability|

+--++
|[-1.0]|[0.97574736,2.425264676902229E-10]|
|[0.5]|[0.47627851732981163,0.5237214826701884]|
|[1.0]|[5.491554426243495E-4,0.9994508445573757]|
|[2.0]|[2.00573870645E-10,0.97994233]|

+--++
 
 **
 File 
"/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
 line 2803, in __main__.FMClassifier
 Failed example:
 model.factors
 Expected:
 DenseMatrix(1, 2, [0.0028, 0.0048], 1)
 Got:
 DenseMatrix(1, 2, [-0.0122, 0.0106], 1)
 **
 2 of 10 in __main__.FMClassifier
 ***Test Failed*** 2 failures.

 

The details see 
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-python-arm/91/console]

And seems the tests failed on x86：

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115668/console]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115665/console]


> Python tests failed on arm64/x86
> 
>
> Key: SPARK-30340
> URL: https://issues.apache.org/jira/browse/SPARK-30340
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Jenkins job spark-master-test-python-arm failed after the commit 
> c6ab7165dd11a0a7b8aea4c805409088e9a41a74:
> {code}
> File 
> "/home/jenkins/workspace/spark-master-test-python-arm/python/pyspark/ml/classification.py",
>  line 2790, in __main__.FMClassifier
>  Failed example:
>  model.transform(test0).select("features", "probability").show(10, False)
>  Expected:
>  +--++
> |features|probability|
> +--++
> |[-1.0]|[0.97574736,2.425264676902229E-10]

[jira] [Commented] (SPARK-30357) SparkContext: Invoking stop() from shutdown hook

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008564#comment-17008564
 ] 

Hyukjin Kwon commented on SPARK-30357:
--

Please just don't copy and paste the logs, and ask why. It's basically asking 
an investigation rather than a bug report.
Please describe some analysis made and steps to reproduce.

> SparkContext: Invoking stop() from shutdown hook
> 
>
> Key: SPARK-30357
> URL: https://issues.apache.org/jira/browse/SPARK-30357
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes, MLlib, PySpark
>Affects Versions: 2.4.4
> Environment: pivotal kubernetes container service PKS  having 3master 
> nodes and 3 worker nodes
>  
> OS : ubuntu for k8-s cluster
>  
>  
>Reporter: Veera
>Priority: Major
>
> I'm getting below error while running spark-submit job in kubernetes , i 
> didn't get any specific error
> {code}
> 19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
> 19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
> http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
> 19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 19/12/26 07:38:18 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> 19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
> 19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
> 19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
> 19/12/26 07:38:18 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
> 19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30357) SparkContext: Invoking stop() from shutdown hook

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30357.
--
Resolution: Incomplete

> SparkContext: Invoking stop() from shutdown hook
> 
>
> Key: SPARK-30357
> URL: https://issues.apache.org/jira/browse/SPARK-30357
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes, MLlib, PySpark
>Affects Versions: 2.4.4
> Environment: pivotal kubernetes container service PKS  having 3master 
> nodes and 3 worker nodes
>  
> OS : ubuntu for k8-s cluster
>  
>  
>Reporter: Veera
>Priority: Major
>
> I'm getting below error while running spark-submit job in kubernetes , i 
> didn't get any specific error
> {code}
> 19/12/26 07:38:18 INFO SparkContext: Invoking stop() from shutdown hook
> 19/12/26 07:38:18 INFO SparkUI: Stopped Spark web UI at 
> http://spark-ml-test-1577345808987-driver-svc.spark-jobs.svc:4040
> 19/12/26 07:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 19/12/26 07:38:18 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 19/12/26 07:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> 19/12/26 07:38:18 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 19/12/26 07:38:18 INFO MemoryStore: MemoryStore cleared
> 19/12/26 07:38:18 INFO BlockManager: BlockManager stopped
> 19/12/26 07:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
> 19/12/26 07:38:18 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 19/12/26 07:38:18 INFO SparkContext: Successfully stopped SparkContext
> 19/12/26 07:38:18 INFO ShutdownHookManager: Shutdown hook called
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a/pyspark-665955c9-0dc6-49ee-9b57-0392be430f10
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-fa047fa8-b573-44aa-ab65-e235fb0d5b09
> 19/12/26 07:38:18 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-ec98f5d3-022d-4eae-87aa-c23a96c77555/spark-b757b85f-3709-4215-8208-95db840a294a
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30335) Clarify behavior of FIRST and LAST without OVER caluse.

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008566#comment-17008566
 ] 

Hyukjin Kwon commented on SPARK-30335:
--

They are not deterministic.

> Clarify behavior of FIRST and LAST without OVER caluse.
> ---
>
> Key: SPARK-30335
> URL: https://issues.apache.org/jira/browse/SPARK-30335
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: xqods9o5ekm3
>Priority: Major
>
> Unlike many databases, Spark SQL allows usage of {{FIRST}} and {{LAST}} in 
> non-analytic contexts.
>  
> At the moment {{FIRST}}
>  
> > first(expr[, isIgnoreNull]) - Returns the first value of {{expr}} for a 
> > group of rows. If {{isIgnoreNull}} is true, returns only non-null values.
>  
> and {{LAST}}
>  
> > last(expr[, isIgnoreNull]) - Returns the last value of {{expr}} for a group 
> > of rows. If {{isIgnoreNull}} is true, returns only non-null values.
>  
> descriptions, suggest that their behavior is deterministic and many users 
> assume that it return specific values for example when query 
>  
> {code:sql}
> SELECT first(foo)
> FROM (
> SELECT * FROM table ORDER BY bar
> )
> {code}
> That however doesn't seem to be the case.
> To make situation worse, it seems to work (for example on small samples in 
> local mode).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008571#comment-17008571
 ] 

Hyukjin Kwon commented on SPARK-30332:
--

Can you narrow down the problem? Looks impossible to reproduce.

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupCollateralPossession,
> GB_position.lkupIsLifetimeMort

[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008572#comment-17008572
 ] 

Hyukjin Kwon commented on SPARK-30328:
--

Why don't you set the Hadoop configuration correctly? I think failing fast 
isn't a horrible idea.

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2020-01-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30316.
--
Resolution: Invalid

Please show reproducible steps and output files (at least {{ls -al}}). 
Otherwise, no one knows what issue you faced.

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Major
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2020-01-05 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008576#comment-17008576
 ] 

Hyukjin Kwon commented on SPARK-30275:
--

What's the benefit of adding it?

> Add gitlab-ci.yml file for reproducible builds
> --
>
> Key: SPARK-30275
> URL: https://issues.apache.org/jira/browse/SPARK-30275
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jim Kleckner
>Priority: Minor
>
> It would be desirable to have public reproducible builds such as provided by 
> gitlab or others.
>  
> Here is a candidate patch set to build spark using gitlab-ci:
> * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30428) File source V2: support partition pruning

2020-01-05 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-30428:
--

 Summary: File source V2: support partition pruning
 Key: SPARK-30428
 URL: https://issues.apache.org/jira/browse/SPARK-30428
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29800) Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf

2020-01-05 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29800:
--
Summary: Rewrite non-correlated EXISTS subquery use ScalaSubquery to 
optimize perf  (was: Rewrite non-correlated subquery use ScalaSubquery to 
optimize perf)

> Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf
> -
>
> Key: SPARK-29800
> URL: https://issues.apache.org/jira/browse/SPARK-29800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30429) WideSchemaBenchmark fails with OOM

2020-01-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30429:
--

 Summary: WideSchemaBenchmark fails with OOM
 Key: SPARK-30429
 URL: https://issues.apache.org/jira/browse/SPARK-30429
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Run WideSchemaBenchmark on the master (commit 
bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via:
{code}
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
{code}
This fails with:
{code}
Caused by: java.lang.reflect.InvocationTargetException
[error] at 
sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
[error] at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[error] at 
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[error] at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
[error] at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[error] at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
[error] at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
[error] ... 132 more
[error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[error] at java.util.Arrays.copyOfRange(Arrays.java:3664)
[error] at java.lang.String.(String.java:207)
[error] at java.lang.StringBuilder.toString(StringBuilder.java:407)
[error] at 
org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
[error] at 
org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410)
[error] at 
org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown 
Source)
{code}
Full stack dump is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30429) WideSchemaBenchmark fails with OOM

2020-01-05 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30429:
---
Attachment: WideSchemaBenchmark_console.txt

> WideSchemaBenchmark fails with OOM
> --
>
> Key: SPARK-30429
> URL: https://issues.apache.org/jira/browse/SPARK-30429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: WideSchemaBenchmark_console.txt
>
>
> Run WideSchemaBenchmark on the master (commit 
> bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via:
> {code}
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
> {code}
> This fails with:
> {code}
> Caused by: java.lang.reflect.InvocationTargetException
> [error]   at 
> sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
> [error]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [error]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
> [error]   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> [error]   ... 132 more
> [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> [error]   at java.util.Arrays.copyOfRange(Arrays.java:3664)
> [error]   at java.lang.String.(String.java:207)
> [error]   at java.lang.StringBuilder.toString(StringBuilder.java:407)
> [error]   at 
> org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
> [error]   at 
> org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410)
> [error]   at 
> org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown 
> Source)
> {code}
> Full stack dump is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

51 matches

Mail list logo