date:20180221

[jira] [Commented] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372430#comment-16372430
 ] 

Apache Spark commented on SPARK-23488:
--

User 'drboyer' has created a pull request for this issue:
https://github.com/apache/spark/pull/20658

> Add other missing Catalog methods to Python API
> ---
>
> Key: SPARK-23488
> URL: https://issues.apache.org/jira/browse/SPARK-23488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.1
>Reporter: Devin Boyer
>Priority: Minor
>
> I noticed the Python Catalog API was missing some methods that are present in 
> the Scala API. These would be handy to have in the Python API as well, 
> especially the database/TableExists() methods.
> I have a PR ready to add these I can open. All methods added:
>  * databaseExists()
>  * tableExists()
>  * functionExists()
>  * getDatabase()
>  * getTable()
>  * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23488:


Assignee: Apache Spark

> Add other missing Catalog methods to Python API
> ---
>
> Key: SPARK-23488
> URL: https://issues.apache.org/jira/browse/SPARK-23488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.1
>Reporter: Devin Boyer
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed the Python Catalog API was missing some methods that are present in 
> the Scala API. These would be handy to have in the Python API as well, 
> especially the database/TableExists() methods.
> I have a PR ready to add these I can open. All methods added:
>  * databaseExists()
>  * tableExists()
>  * functionExists()
>  * getDatabase()
>  * getTable()
>  * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23488:


Assignee: (was: Apache Spark)

> Add other missing Catalog methods to Python API
> ---
>
> Key: SPARK-23488
> URL: https://issues.apache.org/jira/browse/SPARK-23488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.1
>Reporter: Devin Boyer
>Priority: Minor
>
> I noticed the Python Catalog API was missing some methods that are present in 
> the Scala API. These would be handy to have in the Python API as well, 
> especially the database/TableExists() methods.
> I have a PR ready to add these I can open. All methods added:
>  * databaseExists()
>  * tableExists()
>  * functionExists()
>  * getDatabase()
>  * getTable()
>  * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Devin Boyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devin Boyer updated SPARK-23488:

Target Version/s:   (was: 2.2.2, 2.3.1)

> Add other missing Catalog methods to Python API
> ---
>
> Key: SPARK-23488
> URL: https://issues.apache.org/jira/browse/SPARK-23488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.1
>Reporter: Devin Boyer
>Priority: Minor
>
> I noticed the Python Catalog API was missing some methods that are present in 
> the Scala API. These would be handy to have in the Python API as well, 
> especially the database/TableExists() methods.
> I have a PR ready to add these I can open. All methods added:
>  * databaseExists()
>  * tableExists()
>  * functionExists()
>  * getDatabase()
>  * getTable()
>  * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23488) Add other missing Catalog methods to Python API

2018-02-21 Thread Devin Boyer (JIRA)

Devin Boyer created SPARK-23488:
---

 Summary: Add other missing Catalog methods to Python API
 Key: SPARK-23488
 URL: https://issues.apache.org/jira/browse/SPARK-23488
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.2.1
Reporter: Devin Boyer


I noticed the Python Catalog API was missing some methods that are present in 
the Scala API. These would be handy to have in the Python API as well, 
especially the database/TableExists() methods.

I have a PR ready to add these I can open. All methods added:
 * databaseExists()
 * tableExists()
 * functionExists()
 * getDatabase()
 * getTable()
 * getFunction()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23475.
-
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.3.0

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-21 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372360#comment-16372360
 ] 

Liang-Chi Hsieh commented on SPARK-23390:
-

{{FileBasedDataSourceSuite}} seems still flaky.

 

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87603/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]

 

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87600/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]

 

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe,

[jira] [Resolved] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-21 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-23404.
-
Resolution: Invalid

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23487) Fix the failure in spark-branch-2.2-lint

2018-02-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372337#comment-16372337
 ] 

Xiao Li commented on SPARK-23487:
-

This should be introduced by https://github.com/apache/spark/pull/19481

> Fix the failure in spark-branch-2.2-lint
> 
>
> Key: SPARK-23487
> URL: https://issues.apache.org/jira/browse/SPARK-23487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/855/
> Fix failure of the style checking in the branch 2.2. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23487) Fix the failure in spark-branch-2.2-lint

2018-02-21 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23487:
---

 Summary: Fix the failure in spark-branch-2.2-lint
 Key: SPARK-23487
 URL: https://issues.apache.org/jira/browse/SPARK-23487
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Xiao Li


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/855/

Fix failure of the style checking in the branch 2.2. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23361) Driver restart fails if it happens after 7 days from app submission

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372334#comment-16372334
 ] 

Apache Spark commented on SPARK-23361:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20657

> Driver restart fails if it happens after 7 days from app submission
> ---
>
> Key: SPARK-23361
> URL: https://issues.apache.org/jira/browse/SPARK-23361
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you submit an app that is supposed to run for > 7 days (so using 
> \-\principal / \-\-keytab in cluster mode), and there's a failure that causes 
> the driver to restart after 7 days (that being the default token lifetime for 
> HDFS), the new driver will fail with an error like the following:
> {noformat}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (lots of uninteresting token info) can't be found in cache
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1409)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
> {noformat}
> Note: lines may not align with actual Apache code because that's our internal 
> build.
> This happens because as part of the app submission, the launcher provides 
> delegation tokens to be used by the AM (=driver in this case), and those are 
> expired at that point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23361) Driver restart fails if it happens after 7 days from app submission

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23361:


Assignee: Apache Spark

> Driver restart fails if it happens after 7 days from app submission
> ---
>
> Key: SPARK-23361
> URL: https://issues.apache.org/jira/browse/SPARK-23361
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> If you submit an app that is supposed to run for > 7 days (so using 
> \-\principal / \-\-keytab in cluster mode), and there's a failure that causes 
> the driver to restart after 7 days (that being the default token lifetime for 
> HDFS), the new driver will fail with an error like the following:
> {noformat}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (lots of uninteresting token info) can't be found in cache
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1409)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
> {noformat}
> Note: lines may not align with actual Apache code because that's our internal 
> build.
> This happens because as part of the app submission, the launcher provides 
> delegation tokens to be used by the AM (=driver in this case), and those are 
> expired at that point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23361) Driver restart fails if it happens after 7 days from app submission

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23361:


Assignee: (was: Apache Spark)

> Driver restart fails if it happens after 7 days from app submission
> ---
>
> Key: SPARK-23361
> URL: https://issues.apache.org/jira/browse/SPARK-23361
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you submit an app that is supposed to run for > 7 days (so using 
> \-\principal / \-\-keytab in cluster mode), and there's a failure that causes 
> the driver to restart after 7 days (that being the default token lifetime for 
> HDFS), the new driver will fail with an error like the following:
> {noformat}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (lots of uninteresting token info) can't be found in cache
>   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1409)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
> {noformat}
> Note: lines may not align with actual Apache code because that's our internal 
> build.
> This happens because as part of the app submission, the launcher provides 
> delegation tokens to be used by the AM (=driver in this case), and those are 
> expired at that point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23474) mapWithState + async operations = no checkpointing

2018-02-21 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372293#comment-16372293
 ] 

Gabor Somogyi commented on SPARK-23474:
---

[~DLanza] could you share your smallest example? My app works like charm in 
local mode.

> mapWithState + async operations = no checkpointing
> --
>
> Key: SPARK-23474
> URL: https://issues.apache.org/jira/browse/SPARK-23474
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Daniel Lanza García
>Priority: Major
>
> In my Spark Streaming job I use mapWithState which obligates me to enable 
> checkpointing. A job is trigger in each batch by the operation: 
> stream.foreachRDD(rdd.foreachPartition()).
> Under this situation the job was checkpoinitng every 10 minutes (batches of 1 
> minute).
> Now, I have changed the output operation to async: 
> stream.foreachRDD(rdd.foreachPartitionAsync()).
> But checkpointing is not taking place... I tried checkpointing the RDD which 
> I map with state, it get checkpointed but does not break the lineage so tasks 
> keeps growing with every batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2018-02-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372292#comment-16372292
 ] 

Joseph K. Bradley commented on SPARK-22700:
---

Resolved for branch-2.2 via https://github.com/apache/spark/pull/20539

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2018-02-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22700:
--
Fix Version/s: 2.2.2

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372289#comment-16372289
 ] 

Cheng Lian commented on SPARK-19737:


[~LANDAIS Christophe], I filed SPARK-23486 for this. Should be relatively 
straightforward to fix and I'd like to have a new contributor to try it as a 
starter task. Thanks for reporting!

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-23486:
---
Labels: starter  (was: )

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372285#comment-16372285
 ] 

Cheng Lian commented on SPARK-23486:


Please refer to [this 
comment|https://issues.apache.org/jira/browse/SPARK-19737?focusedCommentId=16371377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16371377]
 for more details.

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-23486:
--

 Summary: LookupFunctions should not check the same function name 
more than once
 Key: SPARK-23486
 URL: https://issues.apache.org/jira/browse/SPARK-23486
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.3.0
Reporter: Cheng Lian


For a query invoking the same function multiple times, the current 
{{LookupFunctions}} rule performs a check for each invocation. For users using 
Hive metastore as external catalog, this issues unnecessary metastore accesses 
and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-02-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372269#comment-16372269
 ] 

Hyukjin Kwon commented on SPARK-23483:
--

Sure.

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-23476:
--
Component/s: Spark Core

> Spark will not start in local mode with authentication on
> -
>
> Key: SPARK-23476
> URL: https://issues.apache.org/jira/browse/SPARK-23476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> If spark is run with "spark.authenticate=true", then it will fail to start in 
> local mode.
> {noformat}
> 17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Error: a secret key must be specified via 
> the spark.authenticate.secret config
>   at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
> ...
> {noformat}
> It can be confusing when authentication is turned on by default in a cluster, 
> and one tries to start spark in local mode for a simple test.
> *Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
> config, then the following has to be added
> {{--conf "spark.authenticate=false" --conf 
> "spark.shuffle.service.enabled=false" --conf 
> "spark.dynamicAllocation.enabled=false" --conf 
> "spark.network.crypto.enabled=false" --conf 
> "spark.authenticate.enableSaslEncryption=false"}}
> in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23475:
-
Target Version/s: 2.3.0

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23481.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20654
[https://github.com/apache/spark/pull/20654]

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23484) Fix possible race condition in KafkaContinuousReader

2018-02-21 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23484.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20655
[https://github.com/apache/spark/pull/20655]

> Fix possible race condition in KafkaContinuousReader
> 
>
> Key: SPARK-23484
> URL: https://issues.apache.org/jira/browse/SPARK-23484
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
>
> var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
> accessed from multiple threads - the query thread at the time of reader 
> factory creation, and the epoch tracking thread at the time of 
> `needsReconfiguration`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-21 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372138#comment-16372138
 ] 

Michael Armbrust commented on SPARK-18057:
--

We generally tend towards "don't break things that are working for people" 
rather than "clean".  See the RDD API for an example :).

I'm increasingly pro just keeping the name and upgrading the client.  If they 
ever break compatibility again we can have yet another artifact name, but I 
hope it doesn't come to that.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372135#comment-16372135
 ] 

Apache Spark commented on SPARK-23475:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20656

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist

2018-02-21 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372134#comment-16372134
 ] 

Imran Rashid commented on SPARK-23485:
--

Also related to SPARK-16630 ... if that is solved before this for other cluster 
managers, then we should probably roll similar behavior into this for 
kubernetes too

> Kubernetes should support node blacklist
> 
>
> Key: SPARK-23485
> URL: https://issues.apache.org/jira/browse/SPARK-23485
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not 
> use for running tasks (eg., because of bad hardware).  When running in yarn, 
> this blacklist is used to avoid ever allocating resources on blacklisted 
> nodes: 
> https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128
> I'm just beginning to poke around the kubernetes code, so apologies if this 
> is incorrect -- but I didn't see any references to 
> {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it 
> seems this is missing.  Thought of this while looking at SPARK-19755, a 
> similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23485) Kubernetes should support node blacklist

2018-02-21 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-23485:


 Summary: Kubernetes should support node blacklist
 Key: SPARK-23485
 URL: https://issues.apache.org/jira/browse/SPARK-23485
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes, Scheduler
Affects Versions: 2.3.0
Reporter: Imran Rashid


Spark's BlacklistTracker maintains a list of "bad nodes" which it will not use 
for running tasks (eg., because of bad hardware).  When running in yarn, this 
blacklist is used to avoid ever allocating resources on blacklisted nodes: 
https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128

I'm just beginning to poke around the kubernetes code, so apologies if this is 
incorrect -- but I didn't see any references to {{scheduler.nodeBlacklist()}} 
in {{KubernetesClusterSchedulerBackend}} so it seems this is missing.  Thought 
of this while looking at SPARK-19755, a similar issue on mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23484) Fix possible race condition in KafkaContinuousReader

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23484:


Assignee: Tathagata Das  (was: Apache Spark)

> Fix possible race condition in KafkaContinuousReader
> 
>
> Key: SPARK-23484
> URL: https://issues.apache.org/jira/browse/SPARK-23484
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
> accessed from multiple threads - the query thread at the time of reader 
> factory creation, and the epoch tracking thread at the time of 
> `needsReconfiguration`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23484) Fix possible race condition in KafkaContinuousReader

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372109#comment-16372109
 ] 

Apache Spark commented on SPARK-23484:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20655

> Fix possible race condition in KafkaContinuousReader
> 
>
> Key: SPARK-23484
> URL: https://issues.apache.org/jira/browse/SPARK-23484
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
> accessed from multiple threads - the query thread at the time of reader 
> factory creation, and the epoch tracking thread at the time of 
> `needsReconfiguration`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23484) Fix possible race condition in KafkaContinuousReader

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23484:


Assignee: Apache Spark  (was: Tathagata Das)

> Fix possible race condition in KafkaContinuousReader
> 
>
> Key: SPARK-23484
> URL: https://issues.apache.org/jira/browse/SPARK-23484
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
> accessed from multiple threads - the query thread at the time of reader 
> factory creation, and the epoch tracking thread at the time of 
> `needsReconfiguration`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23484) Fix possible race condition in KafkaContinuousReader

2018-02-21 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-23484:
-

 Summary: Fix possible race condition in KafkaContinuousReader
 Key: SPARK-23484
 URL: https://issues.apache.org/jira/browse/SPARK-23484
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


var `KafkaContinuousReader.knownPartitions` should be threadsafe as it is 
accessed from multiple threads - the query thread at the time of reader factory 
creation, and the epoch tracking thread at the time of `needsReconfiguration`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15526) Shade JPMML

2018-02-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372091#comment-16372091
 ] 

Joseph K. Bradley edited comment on SPARK-15526 at 2/21/18 10:09 PM:
-

[~srowen] I just found that PMML is now listed as a dependency in a lot (all?) 
of the Spark modules: catalyst, mllib-local, etc.  I haven't yet tested to see 
if these 2 PRs caused that change, but it seems likely.

Update: Maybe I built it incorrectly...I don't see this issue in the artifacts 
from RC4.


was (Author: josephkb):
[~srowen] I just found that PMML is now listed as a dependency in a lot (all?) 
of the Spark modules: catalyst, mllib-local, etc.  I haven't yet tested to see 
if these 2 PRs caused that change, but it seems likely.

> Shade JPMML
> ---
>
> Key: SPARK-15526
> URL: https://issues.apache.org/jira/browse/SPARK-15526
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Villu Ruusmann
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Spark-MLlib module depends on the JPMML-Model library 
> (org.jpmml:pmml-model:1.2.7) for its PMML export capabilities. The 
> JPMML-Model library is included in the Apache Spark assembly, which makes it 
> very difficult to build and deploy competing PMML exporters that may wish to 
> depend on different versions (typically much newer) of the same library.
> JPMML-Model library classes are not part of Apache Spark public APIs, so it 
> shouldn't be a problem if they are relocated by prepending a prefix 
> "org.spark_project" to their package names using Maven Shade Plugin. The 
> requested treatment is identical to how Google Guava and Jetty dependencies 
> are shaded in the final assembly.
> This issue is raised in relation to the JPMML-SparkML project 
> (https://github.com/jpmml/jpmml-sparkml), which provides PMML export 
> capabilities for Spark ML Pipelines. Currently, application developers who 
> wish to use it must tweak their application classpath, which assumes 
> familiarity with build internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15526) Shade JPMML

2018-02-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15526:
--
Comment: was deleted

(was: [~srowen] I just found that PMML is now listed as a dependency in a lot 
(all?) of the Spark modules: catalyst, mllib-local, etc.  I haven't yet tested 
to see if these 2 PRs caused that change, but it seems likely.

Update: Maybe I built it incorrectly...I don't see this issue in the artifacts 
from RC4.)

> Shade JPMML
> ---
>
> Key: SPARK-15526
> URL: https://issues.apache.org/jira/browse/SPARK-15526
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Villu Ruusmann
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Spark-MLlib module depends on the JPMML-Model library 
> (org.jpmml:pmml-model:1.2.7) for its PMML export capabilities. The 
> JPMML-Model library is included in the Apache Spark assembly, which makes it 
> very difficult to build and deploy competing PMML exporters that may wish to 
> depend on different versions (typically much newer) of the same library.
> JPMML-Model library classes are not part of Apache Spark public APIs, so it 
> shouldn't be a problem if they are relocated by prepending a prefix 
> "org.spark_project" to their package names using Maven Shade Plugin. The 
> requested treatment is identical to how Google Guava and Jetty dependencies 
> are shaded in the final assembly.
> This issue is raised in relation to the JPMML-SparkML project 
> (https://github.com/jpmml/jpmml-sparkml), which provides PMML export 
> capabilities for Spark ML Pipelines. Currently, application developers who 
> wish to use it must tweak their application classpath, which assumes 
> familiarity with build internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15526) Shade JPMML

2018-02-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372091#comment-16372091
 ] 

Joseph K. Bradley commented on SPARK-15526:
---

[~srowen] I just found that PMML is now listed as a dependency in a lot (all?) 
of the Spark modules: catalyst, mllib-local, etc.  I haven't yet tested to see 
if these 2 PRs caused that change, but it seems likely.

> Shade JPMML
> ---
>
> Key: SPARK-15526
> URL: https://issues.apache.org/jira/browse/SPARK-15526
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Villu Ruusmann
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Spark-MLlib module depends on the JPMML-Model library 
> (org.jpmml:pmml-model:1.2.7) for its PMML export capabilities. The 
> JPMML-Model library is included in the Apache Spark assembly, which makes it 
> very difficult to build and deploy competing PMML exporters that may wish to 
> depend on different versions (typically much newer) of the same library.
> JPMML-Model library classes are not part of Apache Spark public APIs, so it 
> shouldn't be a problem if they are relocated by prepending a prefix 
> "org.spark_project" to their package names using Maven Shade Plugin. The 
> requested treatment is identical to how Google Guava and Jetty dependencies 
> are shaded in the final assembly.
> This issue is raised in relation to the JPMML-SparkML project 
> (https://github.com/jpmml/jpmml-sparkml), which provides PMML export 
> capabilities for Spark ML Pipelines. Currently, application developers who 
> wish to use it must tweak their application classpath, which assumes 
> familiarity with build internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23482) R support for robust regression with Huber loss

2018-02-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-23482:
-

 Summary: R support for robust regression with Huber loss
 Key: SPARK-23482
 URL: https://issues.apache.org/jira/browse/SPARK-23482
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


Add support for huber loss for linear regression in R API.  See linked JIRA for 
change in Scala/Java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver

2018-02-21 Thread Pratik Dhumal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372000#comment-16372000
 ] 

Pratik Dhumal commented on SPARK-23427:
---

Sorry for very late reply.

I am facing that issue when Autobroadcast value is not -1.

Somehow, I couldn't reproduce same for Autobroadcast = -1, one thing I have 
noticed is, for me it goes through more than double the iteration when 
autobroadcast is -1. But, At certain point it does not iterate, and get stuck 
(with no errors and info message as *ContextCleaner: Cleaned accumulator* 
)

Also,

This is the *stack trace* I'm getting. 
{code:java}
// code placeholder
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at scala.StringContext.standardInterpolator(StringContext.scala:125)
at scala.StringContext.s(StringContext.scala:95)
at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:220)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226).
{code}
 

Hope this helps.

Thank you.

 

> spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver 
> -
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-02-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371905#comment-16371905
 ] 

Xiao Li commented on SPARK-23483:
-

[~hyukjin.kwon] [~bryanc] Could you lead this effort and help the community to 
address the related issues? Thanks!

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23475:
-
Attachment: (was: Screen Shot 2018-02-21 at 12.39.46 AM.png)

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23481:


Assignee: Apache Spark  (was: Shixiong Zhu)

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371902#comment-16371902
 ] 

Shixiong Zhu commented on SPARK-23481:
--

[~vanzin] Yep. Here is the fix with a regression test: 
https://github.com/apache/spark/pull/20654

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-23481:


 Summary: The job page shows wrong stages when some of stages are 
evicted
 Key: SPARK-23481
 URL: https://issues.apache.org/jira/browse/SPARK-23481
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371897#comment-16371897
 ] 

Shixiong Zhu commented on SPARK-23475:
--

The job page issue is a separated issue. Created SPARK-23481 to track it 
instead.

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371901#comment-16371901
 ] 

Marcelo Vanzin commented on SPARK-23481:


[~zsxwing] you assigned this to yourself so I suppose you were working on a 
patch?

Anyway, this seems to fix it:

{code}
diff --git a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
index efc2853..3990f9c 100644
--- a/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
+++ b/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
@@ -96,7 +96,7 @@ private[spark] class AppStatusStore(
 
   def lastStageAttempt(stageId: Int): v1.StageData = {
 val it = 
store.view(classOf[StageDataWrapper]).index("stageId").reverse().first(stageId)
-  .closeableIterator()
+  .last(stageId).closeableIterator()
 try {
   if (it.hasNext()) {
 it.next().info
{code}


> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371891#comment-16371891
 ] 

Apache Spark commented on SPARK-23459:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20653

> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23459:


Assignee: (was: Apache Spark)

> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-21 Thread Manan Bakshi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi resolved SPARK-23463.
--
Resolution: Not A Problem

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23481:
-
Description: 
Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
spark.ui.retainedStages=10", type the following codes and click the job 19 
page, it will show wrong stage ids:

{code}
val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()

(1 to 20).foreach { i =>
   rdd.repartition(10).count()
}
{code}

Please see the attached screenshots.

  was:
Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
spark.ui.retainedStages=10", type the following codes and click the job 19 
page, it will not wrong stage ids:

{code}
val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()

(1 to 20).foreach { i =>
   rdd.repartition(10).count()
}
{code}

Please see the attached screenshots.


> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23475:
-
Description: 
Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
spark.ui.retainedStages=10", type the following codes and click the "stages" 
page, it will not show completed stages:

{code}
val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()

(1 to 20).foreach { i =>
   rdd.repartition(10).count()
}
{code}

Please see the attached screenshots.

  was:
Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
spark.ui.retainedStages=10", type the following codes and click the "stages" 
page, it will not show completed stages:

{code}
val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()

(1 to 20).foreach { i =>
   rdd.repartition(10).count()
}
{code}

The stages in the job page is also wrong. Please see the attached screenshots.


> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-21 Thread Manan Bakshi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371864#comment-16371864
 ] 

Manan Bakshi commented on SPARK-23463:
--

I will go ahead and resolve this. Thanks! 

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23481:


Assignee: Shixiong Zhu  (was: Apache Spark)

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371904#comment-16371904
 ] 

Apache Spark commented on SPARK-23481:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20654

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will show wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23459:


Assignee: Apache Spark

> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-02-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23483:

Component/s: (was: SQL)
 PySpark

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-02-21 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23483:
---

 Summary: Feature parity for Python vs Scala APIs
 Key: SPARK-23483
 URL: https://issues.apache.org/jira/browse/SPARK-23483
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23481:
-
Description: 
Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
spark.ui.retainedStages=10", type the following codes and click the job 19 
page, it will not wrong stage ids:

{code}
val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()

(1 to 20).foreach { i =>
   rdd.repartition(10).count()
}
{code}

Please see the attached screenshots.

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will not wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23481) The job page shows wrong stages when some of stages are evicted

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23481:
-
Attachment: Screen Shot 2018-02-21 at 12.39.46 AM.png

> The job page shows wrong stages when some of stages are evicted
> ---
>
> Key: SPARK-23481
> URL: https://issues.apache.org/jira/browse/SPARK-23481
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the job 19 
> page, it will not wrong stage ids:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500

2018-02-21 Thread HARIKRISHNAN Ck (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371628#comment-16371628
 ] 

HARIKRISHNAN Ck commented on SPARK-21302:
-

Spark Driver logs and RM logs should tell some pointers about the root cause of 
the issue. 

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
>Priority: Major
> Attachments: npe.PNG, nullpointer.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:17 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-21 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371474#comment-16371474
 ] 

Kazuaki Ishizaki commented on SPARK-23459:
--

I am working for this.

> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371473#comment-16371473
 ] 

Apache Spark commented on SPARK-23476:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20652

> Spark will not start in local mode with authentication on
> -
>
> Key: SPARK-23476
> URL: https://issues.apache.org/jira/browse/SPARK-23476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> If spark is run with "spark.authenticate=true", then it will fail to start in 
> local mode.
> {noformat}
> 17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Error: a secret key must be specified via 
> the spark.authenticate.secret config
>   at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
> ...
> {noformat}
> It can be confusing when authentication is turned on by default in a cluster, 
> and one tries to start spark in local mode for a simple test.
> *Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
> config, then the following has to be added
> {{--conf "spark.authenticate=false" --conf 
> "spark.shuffle.service.enabled=false" --conf 
> "spark.dynamicAllocation.enabled=false" --conf 
> "spark.network.crypto.enabled=false" --conf 
> "spark.authenticate.enableSaslEncryption=false"}}
> in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23476:


Assignee: Apache Spark

> Spark will not start in local mode with authentication on
> -
>
> Key: SPARK-23476
> URL: https://issues.apache.org/jira/browse/SPARK-23476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>
> If spark is run with "spark.authenticate=true", then it will fail to start in 
> local mode.
> {noformat}
> 17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Error: a secret key must be specified via 
> the spark.authenticate.secret config
>   at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
> ...
> {noformat}
> It can be confusing when authentication is turned on by default in a cluster, 
> and one tries to start spark in local mode for a simple test.
> *Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
> config, then the following has to be added
> {{--conf "spark.authenticate=false" --conf 
> "spark.shuffle.service.enabled=false" --conf 
> "spark.dynamicAllocation.enabled=false" --conf 
> "spark.network.crypto.enabled=false" --conf 
> "spark.authenticate.enableSaslEncryption=false"}}
> in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23476:


Assignee: (was: Apache Spark)

> Spark will not start in local mode with authentication on
> -
>
> Key: SPARK-23476
> URL: https://issues.apache.org/jira/browse/SPARK-23476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> If spark is run with "spark.authenticate=true", then it will fail to start in 
> local mode.
> {noformat}
> 17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Error: a secret key must be specified via 
> the spark.authenticate.secret config
>   at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
> ...
> {noformat}
> It can be confusing when authentication is turned on by default in a cluster, 
> and one tries to start spark in local mode for a simple test.
> *Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
> config, then the following has to be added
> {{--conf "spark.authenticate=false" --conf 
> "spark.shuffle.service.enabled=false" --conf 
> "spark.dynamicAllocation.enabled=false" --conf 
> "spark.network.crypto.enabled=false" --conf 
> "spark.authenticate.enableSaslEncryption=false"}}
> in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23480) NullPointerException in AppendOnlyMap.growTable

2018-02-21 Thread Erik LaBianca (JIRA)

Erik LaBianca created SPARK-23480:
-

 Summary: NullPointerException in AppendOnlyMap.growTable
 Key: SPARK-23480
 URL: https://issues.apache.org/jira/browse/SPARK-23480
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.1.1
Reporter: Erik LaBianca


I'm find a rather strange NPE in AppendOnlyMap. The stack trace is as follows:

 
{noformat}
java.lang.NullPointerException
at 
org.apache.spark.util.collection.AppendOnlyMap.growTable(AppendOnlyMap.scala:248)
at 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable(SizeTrackingAppendOnlyMap.scala:38)
at 
org.apache.spark.util.collection.AppendOnlyMap.incrementSize(AppendOnlyMap.scala:204)
at 
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:147)
at 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
The code in question, according to GitHub, is the following, particularly the 
last line starting with `growThreshold`:

 
{code:java}
/** Double the table's size and re-hash everything */
protected def growTable() {
// capacity < MAXIMUM_CAPACITY (2 ^ 29) so capacity * 2 won't overflow
val newCapacity = capacity * 2
require(newCapacity <= MAXIMUM_CAPACITY, s"Can't contain more than 
${growThreshold} elements")
val newData = new Array[AnyRef](2 * newCapacity)
val newMask = newCapacity - 1
// Insert all our old values into the new array. Note that because our old keys 
are
// unique, there's no need to check for equality here when we insert.
var oldPos = 0
while (oldPos < capacity) {
if (!data(2 * oldPos).eq(null)) {
val key = data(2 * oldPos)
val value = data(2 * oldPos + 1)
var newPos = rehash(key.hashCode) & newMask
var i = 1
var keepGoing = true
while (keepGoing) {
val curKey = newData(2 * newPos)
if (curKey.eq(null)) {
newData(2 * newPos) = key
newData(2 * newPos + 1) = value
keepGoing = false
} else {
val delta = i
newPos = (newPos + delta) & newMask
i += 1
}
}
}
oldPos += 1
}
data = newData
capacity = newCapacity
mask = newMask
growThreshold = (LOAD_FACTOR * newCapacity).toInt
}{code}
 

Unfortunately I haven't got a simple repro case for this, it's coming from 
production logs as I try to track down seemingly random executor failures. 
Off-hand I suspect the issue is an OOM condition that isn't being well reported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23475:


Assignee: (was: Apache Spark)

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23475:


Assignee: Apache Spark

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371378#comment-16371378
 ] 

Apache Spark commented on SPARK-23475:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20651

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread LANDAIS Christophe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371377#comment-16371377
 ] 

LANDAIS Christophe commented on SPARK-19737:


Hello,

Migrating our application from spark 2.1.1 to spark 2.2.1, we see a major 
degradation in spark-SQL timing. One insert takes 5 seconds in 2.1.1 and 75 
seconds in spark 2.2.1. Looking in executor traces (I force configuration to 
one executor) , we see it takes time between spark.sql(“insert into”) is done 
and task is submitted to executor

My application traces :

2018-02-21 06:30:53 - Executor[1] Going to execute request …

2018-02-21 06:32:08 - Executor[1] request executed (tag: NO_TAG) (table: 
ca4mn.sys_4g_pcmd_mme_15min) (date: 20180221061500) - duration (s)  74.846

 

Executor trace :

18/02/21 06:30:52 INFO Executor: Finished task 0.0 in stage 3.0 (TID 1). 4675 
bytes result sent to driver  (landais note: this is the previous task that is 
terminated)

18/02/21 06:32:06 INFO CoarseGrainedExecutorBackend: Got assigned task 2

 

What is doing spark between 06:30:53 and 06:32:06 ? I have taken several thread 
dump in the container while execution was in progress, with a delay of 2 
seconds between thread dump. They are identical. Thread dump is put at the end 
of this comment.

Thread dump shows time is taken while verifying function exists: it is 
SPARK-19737 modification.

My SQL request contains 1000 functions because we are doing aggregation on many 
columns. Functions are like MAX, MIN, etc …

 

Please, can you perform a modification that improves this check ? For example: 
doing only one check for each different function ? Or why not introducing a 
spark parameter to bypass this check ?



Thread dump

178 "Executor[1]" #95 prio=5 os_prio=0 tid=0x7f587f355800 nid=0x7c runnable 
[0x7f57549f7000]

179    java.lang.Thread.State: RUNNABLE

180 at java.net.SocketInputStream.socketRead0(Native Method)

181 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

182 at java.net.SocketInputStream.read(SocketInputStream.java:171)

183 at java.net.SocketInputStream.read(SocketInputStream.java:141)

184 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)

185 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)

186 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

187 - locked <0x8913b110> (a java.io.BufferedInputStream)

188 at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

189 at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

190 at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

191 at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

192 at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

193 at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)

194 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:654)

195 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:641)

196 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1158)

197 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) 
   **

198 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 

199 at java.lang.reflect.Method.invoke(Method.java:498)

200 at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)

201 at com.sun.proxy.$Proxy31.getDatabase(Unknown Source)

202 at 
org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1301)

203 at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1290)  **

204 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply$mcZ$sp(HiveClientImpl.scala:358)

205 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358)

206 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:358)

207 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)

208 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)

209 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)

210 - locked <0x8900dd88> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader)

211 at

[jira] [Commented] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371347#comment-16371347
 ] 

Marco Gaido commented on SPARK-23475:
-

The reason of this behavior is that SKIPPED stages, which were previously shown 
in the PENDING table, are not shown anymore. This was introduced by 
SPARK-20648. I will submit a fix soon.

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23473) spark.catalog.listTables error when database name starts with a number

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371279#comment-16371279
 ] 

Marco Gaido edited comment on SPARK-23473 at 2/21/18 11:53 AM:
---

Your stack error points out which is the real issue:

{code}
18/02/21 15:47:45 ERROR log: error in initSerDe: 
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
{code}

There is no problem with databases starting with a number. The problem is that 
you have an external table pointing to HBase in that DB and you don't have 
added HBase jars to Spark.

Therefore I am closing this JIRA as Invalid.


was (Author: mgaido):
Your stack error points out which is the real issue:

{code}
18/02/21 15:47:45 ERROR log: error in initSerDe: 
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
{code}

THere is no problem with databases starting with a number. The problem is that 
you have an external table pointing to HBase in that DB and you don't have 
added HBase jars to Spark.

> spark.catalog.listTables error when database name starts with a number
> --
>
> Key: SPARK-23473
> URL: https://issues.apache.org/jira/browse/SPARK-23473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Goun Na
>Priority: Trivial
> Attachments: spark_catalog_err.txt
>
>
> Errors when Hive database name starts with a number such as 11st. 
> 
>   
> scala> spark.catalog.setCurrentDatabase("11st")
> scala> spark.catalog.listTables
> scala> spark.catalog.listTables
>  18/02/21 15:47:44 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
>  at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
>  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
>  at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23473) spark.catalog.listTables error when database name starts with a number

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371279#comment-16371279
 ] 

Marco Gaido commented on SPARK-23473:
-

Your stack error points out which is the real issue:

{code}
18/02/21 15:47:45 ERROR log: error in initSerDe: 
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe 
not found
{code}

THere is no problem with databases starting with a number. The problem is that 
you have an external table pointing to HBase in that DB and you don't have 
added HBase jars to Spark.

> spark.catalog.listTables error when database name starts with a number
> --
>
> Key: SPARK-23473
> URL: https://issues.apache.org/jira/browse/SPARK-23473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Goun Na
>Priority: Trivial
> Attachments: spark_catalog_err.txt
>
>
> Errors when Hive database name starts with a number such as 11st. 
> 
>   
> scala> spark.catalog.setCurrentDatabase("11st")
> scala> spark.catalog.listTables
> scala> spark.catalog.listTables
>  18/02/21 15:47:44 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
>  at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
>  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
>  at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23473) spark.catalog.listTables error when database name starts with a number

2018-02-21 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-23473.
-
Resolution: Invalid

> spark.catalog.listTables error when database name starts with a number
> --
>
> Key: SPARK-23473
> URL: https://issues.apache.org/jira/browse/SPARK-23473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Goun Na
>Priority: Trivial
> Attachments: spark_catalog_err.txt
>
>
> Errors when Hive database name starts with a number such as 11st. 
> 
>   
> scala> spark.catalog.setCurrentDatabase("11st")
> scala> spark.catalog.listTables
> scala> spark.catalog.listTables
>  18/02/21 15:47:44 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
>  at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
>  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
>  at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23477) Misleading exception message when union fails due to metadata

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371278#comment-16371278
 ] 

Marco Gaido commented on SPARK-23477:
-

[~kretes] yes. I think we can close this, do you agree?

> Misleading exception message when union fails due to metadata 
> --
>
> Key: SPARK-23477
> URL: https://issues.apache.org/jira/browse/SPARK-23477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When I have two DF's that are different only in terms of metadata in fields 
> inside a struct - I cannot union them but the error message shows that they 
> are the same:
> {code:java}
> df = spark.createDataFrame([{'a':1}])
> a = df.select(struct('a').alias('x'))
> b = 
> df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct(col('a')).alias('x'))
> a.union(b).printSchema(){code}
> gives:
> {code:java}
> An error occurred while calling o1076.union.
> : org.apache.spark.sql.AnalysisException: Union can only be performed on 
> tables with the compatible column types. struct <> struct 
> at the first column of the second table{code}
> and this part:
> {code:java}
> struct <> struct{code}
> does not make any sense because those are the same.
>  
> Since metadata must be the same for union -> it should be incuded in the 
> error message



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23478) Inconsistent behaviour of union when columns have conflicting metadata

2018-02-21 Thread Tomasz Bartczak (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Bartczak updated SPARK-23478:

Priority: Minor  (was: Major)

> Inconsistent behaviour of union when columns have conflicting metadata
> --
>
> Key: SPARK-23478
> URL: https://issues.apache.org/jira/browse/SPARK-23478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When columns have different metadata and we union dataframes with them - the 
> end result of metadata depends on union ordering:
> {code:java}
> df = spark.createDataFrame([{'a':1}])
> a = df
> b = df.select(col('a').alias('a',metadata={'description':'xxx'}))
> print("a.union(b) gives {}".format(a.union(b).schema.fields[0].metadata))
> print("b.union(a) gives {}".format(b.union(a).schema.fields[0].metadata))
> {code}
> gives:
> {code:java}
> a.union(b) gives {}
> b.union(a) gives {'description': 'xxx'}{code}
>  
> And I wonder if this kind of union should be allowed at all - when fields 
> with different metadata are inside a struct - union fails, which can be seen 
> in https://issues.apache.org/jira/projects/SPARK/issues/SPARK-23477



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23477) Misleading exception message when union fails due to metadata

2018-02-21 Thread Tomasz Bartczak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371271#comment-16371271
 ] 

Tomasz Bartczak commented on SPARK-23477:
-

So what is the effect on master? no exception at all and you can get a 
successfull union result?

> Misleading exception message when union fails due to metadata 
> --
>
> Key: SPARK-23477
> URL: https://issues.apache.org/jira/browse/SPARK-23477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When I have two DF's that are different only in terms of metadata in fields 
> inside a struct - I cannot union them but the error message shows that they 
> are the same:
> {code:java}
> df = spark.createDataFrame([{'a':1}])
> a = df.select(struct('a').alias('x'))
> b = 
> df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct(col('a')).alias('x'))
> a.union(b).printSchema(){code}
> gives:
> {code:java}
> An error occurred while calling o1076.union.
> : org.apache.spark.sql.AnalysisException: Union can only be performed on 
> tables with the compatible column types. struct <> struct 
> at the first column of the second table{code}
> and this part:
> {code:java}
> struct <> struct{code}
> does not make any sense because those are the same.
>  
> Since metadata must be the same for union -> it should be incuded in the 
> error message



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23479) struct() cannot be combined with alias(metadata={})

2018-02-21 Thread Tomasz Bartczak (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Bartczak updated SPARK-23479:

Priority: Minor  (was: Major)

> struct() cannot be combined with alias(metadata={})
> ---
>
> Key: SPARK-23479
> URL: https://issues.apache.org/jira/browse/SPARK-23479
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> when creating a struct with 'struct()' function - metadata added with alias() 
> is lost:
> {code:java}
> df =spark.createDataFrame([{'a':1}])
> df.select(struct(col('a').alias('a',metadata={'description':'xxx'}))).schema.fields[0].dataType.fields[0].metadata{code}
> gives:
> {code:java}
> {}{code}
> workaround is to create the column with metadata before adding it to the 
> struct, but this is obviously bad behaviour:
> {code:java}
> df =spark.createDataFrame([{'a':1}])
> df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct('a')).schema.fields[0].dataType.fields[0].metadata{code}
> keeps metadata:
> {code:java}
> {'description': 'xxx'}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23408) Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371247#comment-16371247
 ] 

Apache Spark commented on SPARK-23408:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20650

> Flaky test: StreamingOuterJoinSuite.left outer early state exclusion on right
> -
>
> Key: SPARK-23408
> URL: https://issues.apache.org/jira/browse/SPARK-23408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Seen on an unrelated PR.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87386/testReport/org.apache.spark.sql.streaming/StreamingOuterJoinSuite/left_outer_early_state_exclusion_on_right/
> {noformat}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Assert on query failed: Check total state rows = List(4), updated state rows 
> = List(4): Array(1) did not equal List(4) incorrect updates rows
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:28)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:23)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1$$anonfun$apply$14.apply$mcZ$sp(StreamTest.scala:568)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:371)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:568)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:432)
>   
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> == Progress ==
>AddData to MemoryStream[value#19652]: 3,4,5
>AddData to MemoryStream[value#19662]: 1,2,3
>CheckLastBatch: [3,10,6,9]
> => AssertOnQuery(, Check total state rows = List(4), updated state 
> rows = List(4))
>AddData to MemoryStream[value#19652]: 20
>AddData to MemoryStream[value#19662]: 21
>CheckLastBatch: 
>AddData to MemoryStream[value#19662]: 20
>CheckLastBatch: [20,30,40,60],[4,10,8,null],[5,10,10,null]
> == Stream ==
> Output Mode: Append
> Stream state: {MemoryStream[value#19652]: 0,MemoryStream[value#19662]: 0}
> Thread state: alive
> Thread stack trace: java.lang.Thread.sleep(Native Method)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:152)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:120)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> {noformat}
> No other failures in the history, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23477) Misleading exception message when union fails due to metadata

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371238#comment-16371238
 ] 

Marco Gaido commented on SPARK-23477:
-

I cannot reproduce this on master.

> Misleading exception message when union fails due to metadata 
> --
>
> Key: SPARK-23477
> URL: https://issues.apache.org/jira/browse/SPARK-23477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When I have two DF's that are different only in terms of metadata in fields 
> inside a struct - I cannot union them but the error message shows that they 
> are the same:
> {code:java}
> df = spark.createDataFrame([{'a':1}])
> a = df.select(struct('a').alias('x'))
> b = 
> df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct(col('a')).alias('x'))
> a.union(b).printSchema(){code}
> gives:
> {code:java}
> An error occurred while calling o1076.union.
> : org.apache.spark.sql.AnalysisException: Union can only be performed on 
> tables with the compatible column types. struct <> struct 
> at the first column of the second table{code}
> and this part:
> {code:java}
> struct <> struct{code}
> does not make any sense because those are the same.
>  
> Since metadata must be the same for union -> it should be incuded in the 
> error message



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23479) struct() cannot be combined with alias(metadata={})

2018-02-21 Thread Tomasz Bartczak (JIRA)

Tomasz Bartczak created SPARK-23479:
---

 Summary: struct() cannot be combined with alias(metadata={})
 Key: SPARK-23479
 URL: https://issues.apache.org/jira/browse/SPARK-23479
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Tomasz Bartczak


when creating a struct with 'struct()' function - metadata added with alias() 
is lost:
{code:java}
df =spark.createDataFrame([{'a':1}])
df.select(struct(col('a').alias('a',metadata={'description':'xxx'}))).schema.fields[0].dataType.fields[0].metadata{code}
gives:
{code:java}
{}{code}
workaround is to create the column with metadata before adding it to the 
struct, but this is obviously bad behaviour:
{code:java}
df =spark.createDataFrame([{'a':1}])
df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct('a')).schema.fields[0].dataType.fields[0].metadata{code}
keeps metadata:
{code:java}
{'description': 'xxx'}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23478) Inconsistent behaviour of union when columns have conflicting metadata

2018-02-21 Thread Tomasz Bartczak (JIRA)

Tomasz Bartczak created SPARK-23478:
---

 Summary: Inconsistent behaviour of union when columns have 
conflicting metadata
 Key: SPARK-23478
 URL: https://issues.apache.org/jira/browse/SPARK-23478
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Tomasz Bartczak


When columns have different metadata and we union dataframes with them - the 
end result of metadata depends on union ordering:
{code:java}
df = spark.createDataFrame([{'a':1}])
a = df
b = df.select(col('a').alias('a',metadata={'description':'xxx'}))
print("a.union(b) gives {}".format(a.union(b).schema.fields[0].metadata))
print("b.union(a) gives {}".format(b.union(a).schema.fields[0].metadata))

{code}
gives:
{code:java}
a.union(b) gives {}
b.union(a) gives {'description': 'xxx'}{code}
 

And I wonder if this kind of union should be allowed at all - when fields with 
different metadata are inside a struct - union fails, which can be seen in 
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-23477



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23477) Misleading exception message when union fails due to metadata

2018-02-21 Thread Tomasz Bartczak (JIRA)

Tomasz Bartczak created SPARK-23477:
---

 Summary: Misleading exception message when union fails due to 
metadata 
 Key: SPARK-23477
 URL: https://issues.apache.org/jira/browse/SPARK-23477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Tomasz Bartczak


When I have two DF's that are different only in terms of metadata in fields 
inside a struct - I cannot union them but the error message shows that they are 
the same:
{code:java}
df = spark.createDataFrame([{'a':1}])
a = df.select(struct('a').alias('x'))
b = 
df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct(col('a')).alias('x'))
a.union(b).printSchema(){code}
gives:
{code:java}
An error occurred while calling o1076.union.
: org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. struct <> struct at the 
first column of the second table{code}
and this part:
{code:java}
struct <> struct{code}
does not make any sense because those are the same.

 

Since metadata must be the same for union -> it should be incuded in the error 
message



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23473) spark.catalog.listTables error when database name starts with a number

2018-02-21 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23473:

Component/s: (was: Spark Core)
 SQL

> spark.catalog.listTables error when database name starts with a number
> --
>
> Key: SPARK-23473
> URL: https://issues.apache.org/jira/browse/SPARK-23473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Goun Na
>Priority: Trivial
> Attachments: spark_catalog_err.txt
>
>
> Errors when Hive database name starts with a number such as 11st. 
> 
>   
> scala> spark.catalog.setCurrentDatabase("11st")
> scala> spark.catalog.listTables
> scala> spark.catalog.listTables
>  18/02/21 15:47:44 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
>  at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
>  at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
>  at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
>  at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371186#comment-16371186
 ] 

Marco Gaido commented on SPARK-23463:
-

It changed Spark's implicit casting. Probably in 2.1.1 comparing a string and 
an integer resulted in automatically casting both to string, while now both are 
casted to integers.I think it is arguable what is the right implicit cast to 
perform in this case, but relying on automatic casts is very dangerous and 
should be avoided, because it can lead to unexpected behaviors like in this 
case. I think we can close this JIRA, do you agree [~m.bakshi11]?

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-21 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371174#comment-16371174
 ] 

Sönke Liebau edited comment on SPARK-18057 at 2/21/18 10:10 AM:


I agree that we should not rename the existing kafka_10 package, as that would 
probably cause people loads of pain.

It is however tempting to have "clean" naming - would it be an option to add a 
package simply called kafka and update the kafka version in that? We could keep 
the kafka_10 package around for now but deprecate it at some point in time.

I am a bit on the fence about this, as in principle the current Kafka story is: 
"Any client with a version of 0.10.2.0 or later will support brokers of version 
0.10.x or later, so simply upgrading the kafka version in the existing package 
should not break anything, as 0.9.x is not currently supported anyway. 
However there is a caveat:
{code}
If the burden of backwards compatibility becomes too large, at some point we 
may need to break it.
{code}
So there is the possibility of the kafka_10 package becoming relevant again if 
later Kafka versions stop supporting 0.10.x brokers.


was (Author: sliebau):
I agree that we should not rename the existing kafka_10 package, as that would 
probably cause people loads of pain.

It is however tempting to have "clean" naming - would it be an option to add a 
package simply called kafka and update the kafka version in that? We could keep 
the kafka_10 package around for now but deprecate it at some point in time.

I am a bit on the fence about this, as in principle the current Kafka story is: 
"Any client with a version of 0.10.2.0 or later will support brokers of version 
0.10.x or later, so simply upgrading the kafka version in the existing package 
should not break anything, as 0.9.x is not currently supported anyway. 
However there is a caveat:
{code:java}
If the burden of backwards compatibility becomes too large, at some point we 
may need to break it.
{code}
So there is the possibility of the kafka_10 package becoming relevant again if 
later Kafka versions stop supporting 0.10.x brokers.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-02-21 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371174#comment-16371174
 ] 

Sönke Liebau commented on SPARK-18057:
--

I agree that we should not rename the existing kafka_10 package, as that would 
probably cause people loads of pain.

It is however tempting to have "clean" naming - would it be an option to add a 
package simply called kafka and update the kafka version in that? We could keep 
the kafka_10 package around for now but deprecate it at some point in time.

I am a bit on the fence about this, as in principle the current Kafka story is: 
"Any client with a version of 0.10.2.0 or later will support brokers of version 
0.10.x or later, so simply upgrading the kafka version in the existing package 
should not break anything, as 0.9.x is not currently supported anyway. 
However there is a caveat:
{code:java}
If the burden of backwards compatibility becomes too large, at some point we 
may need to break it.
{code}
So there is the possibility of the kafka_10 package becoming relevant again if 
later Kafka versions stop supporting 0.10.x brokers.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23462) Improve the error message in `StructType`

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371150#comment-16371150
 ] 

Apache Spark commented on SPARK-23462:
--

User 'xysun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20649

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23462) Improve the error message in `StructType`

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23462:


Assignee: (was: Apache Spark)

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23462) Improve the error message in `StructType`

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23462:


Assignee: Apache Spark

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23462) Improve the error message in `StructType`

2018-02-21 Thread Xiayun Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371148#comment-16371148
 ] 

Xiayun Sun commented on SPARK-23462:


PR created: [https://github.com/apache/spark/pull/20649] 

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23452:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23452:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371144#comment-16371144
 ] 

Apache Spark commented on SPARK-23452:
--

User 'xysun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20649

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371128#comment-16371128
 ] 

Gabor Somogyi commented on SPARK-23476:
---

I'm working on it.

> Spark will not start in local mode with authentication on
> -
>
> Key: SPARK-23476
> URL: https://issues.apache.org/jira/browse/SPARK-23476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> If spark is run with "spark.authenticate=true", then it will fail to start in 
> local mode.
> {noformat}
> 17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Error: a secret key must be specified via 
> the spark.authenticate.secret config
>   at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
> ...
> {noformat}
> It can be confusing when authentication is turned on by default in a cluster, 
> and one tries to start spark in local mode for a simple test.
> *Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
> config, then the following has to be added
> {{--conf "spark.authenticate=false" --conf 
> "spark.shuffle.service.enabled=false" --conf 
> "spark.dynamicAllocation.enabled=false" --conf 
> "spark.network.crypto.enabled=false" --conf 
> "spark.authenticate.enableSaslEncryption=false"}}
> in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23476) Spark will not start in local mode with authentication on

2018-02-21 Thread Gabor Somogyi (JIRA)

Gabor Somogyi created SPARK-23476:
-

 Summary: Spark will not start in local mode with authentication on
 Key: SPARK-23476
 URL: https://issues.apache.org/jira/browse/SPARK-23476
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.0
Reporter: Gabor Somogyi


If spark is run with "spark.authenticate=true", then it will fail to start in 
local mode.
{noformat}
17/02/03 12:09:39 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Error: a secret key must be specified via 
the spark.authenticate.secret config
at 
org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:401)
at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:258)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:199)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:290)
...
{noformat}
It can be confusing when authentication is turned on by default in a cluster, 
and one tries to start spark in local mode for a simple test.

*Workaround*: If {{spark.authenticate=true}} is specified as a cluster wide 
config, then the following has to be added
{{--conf "spark.authenticate=false" --conf 
"spark.shuffle.service.enabled=false" --conf 
"spark.dynamicAllocation.enabled=false" --conf 
"spark.network.crypto.enabled=false" --conf 
"spark.authenticate.enableSaslEncryption=false"}}
in the spark-submit command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23462) Improve the error message in `StructType`

2018-02-21 Thread Xiayun Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371122#comment-16371122
 ] 

Xiayun Sun commented on SPARK-23462:


I will take this. 

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23448:


Assignee: Apache Spark

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Assignee: Apache Spark
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23448:


Assignee: (was: Apache Spark)

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371120#comment-16371120
 ] 

Apache Spark commented on SPARK-23448:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20648

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23406) Stream-stream self joins does not work

2018-02-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23406:
-
Fix Version/s: (was: 3.0.0)
   2.4.0

> Stream-stream self joins does not work
> --
>
> Key: SPARK-23406
> URL: https://issues.apache.org/jira/browse/SPARK-23406
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently stream-stream self join throws the following error
> {code}
> val df = spark.readStream.format("rate").option("numRowsPerSecond", 
> "1").option("numPartitions", "1").load()
> display(df.withColumn("key", $"value" / 10).join(df.withColumn("key", 
> $"value" / 5), "key"))
> {code}
> error:
> {code}
> Failure when resolving conflicting references in Join:
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> Conflicting attributes: timestamp#850,value#851L
> ;;
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:378)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:98)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:148)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:98)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:71)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:73)
>  at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3063)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:787)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:756)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:731)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23475) The "stages" page doesn't show any completed stages

2018-02-21 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371098#comment-16371098
 ] 

Shixiong Zhu commented on SPARK-23475:
--

cc [~vanzin]

> The "stages" page doesn't show any completed stages
> ---
>
> Key: SPARK-23475
> URL: https://issues.apache.org/jira/browse/SPARK-23475
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-02-21 at 12.39.39 AM.png, Screen Shot 
> 2018-02-21 at 12.39.46 AM.png
>
>
> Run "bin/spark-shell --conf spark.ui.retainedJobs=10 --conf 
> spark.ui.retainedStages=10", type the following codes and click the "stages" 
> page, it will not show completed stages:
> {code}
> val rdd = sc.parallelize(0 to 100, 100).repartition(10).cache()
> (1 to 20).foreach { i =>
>rdd.repartition(10).count()
> }
> {code}
> The stages in the job page is also wrong. Please see the attached screenshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 104 matches

Mail list logo