[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Attachment: Instance-Config-P2.JPG
Instance-Config-P1.JPG
Cluster-Config-P2.JPG
Cluster-Config-P1.JPG

> Spark Streaming Job stopped reading events from Queue upon Deregister 
> Exception
> ---
>
> Key: SPARK-30675
> URL: https://issues.apache.org/jira/browse/SPARK-30675
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
> Attachments: Cluster-Config-P1.JPG, Cluster-Config-P2.JPG, 
> Instance-Config-P1.JPG, Instance-Config-P2.JPG
>
>
>  
> *+Stream+*
> We have observed discrepancy in  kinesis stream, whereas stream has 
> continuous incoming records but GetRecords.Records is not available.
>  
> Upon analysis, we have understood that there were no GetRecords calls made by 
> Spark Job during the time due to which the GetRecords count is not available, 
> hence there should not be any issues with streams as the messages were being 
> received.
> *+Spark/EMR+*
> From the driver logs, it has been found that the driver de-registered the 
> receiver for the stream
> +*_Driver Logs_*+
> 2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
> receiver for stream 0: Error while storing block into Spark - 
> java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]{color}*
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>     at 
> org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>     at 
> org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
>     ...
> *Till this point, there is no receiver being started/registered. From the 
> executor logs (below), it has been observed that one of the executors was 
> running on the container.*
>  
> +*_Executor Logs_*+
> 2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002
> 2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
> {color:#de350b}*S**topping receiver with message: Error while storing block 
> into Spark: java.util.concurrent.TimeoutException: Futures timed out after 
> [30 seconds]*{color}
> 2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.
> 2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  has successfully stopped lease-tracking threads
> 2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting 
> down workerId 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  with reason ZOMBIE
> 2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
> as bytes in memory (estimated size /7.3 KB, free 3.4 GB)
> 2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
> down successfully.
>  
> *After this point, the Kinesis KCL worker seemed to be terminated which was 
> reading the Queue, due to which we could see the gap in the GetRecords.*  
>  
> +*Mitigation*+
> Increased the timeout
>  * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
> 30 seconds) 
>  * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
> default - 5seconds)
>  
> Note : 
>  1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket
> 2. Spark submit Configuration as below:
> spark-submit --deploy-mode cluster 

[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
 

*+Stream+*

We have observed discrepancy in  kinesis stream, whereas stream has continuous 
incoming records but GetRecords.Records is not available.

 

Upon analysis, we have understood that there were no GetRecords calls made by 
Spark Job during the time due to which the GetRecords count is not available, 
hence there should not be any issues with streams as the messages were being 
received.

*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

+*_Driver Logs_*+

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

+*_Executor Logs_*+

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

 

*After this point, the Kinesis KCL worker seemed to be terminated which was 
reading the Queue, due to which we could see the gap in the GetRecords.*  

 

+*Mitigation*+

Increased the timeout
 * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
30 seconds) 
 * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
default - 5seconds)

 

Note : 
 1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket

2. Spark submit Configuration as below:

spark-submit --deploy-mode cluster --executor-memory 4608M --driver-memory 
4608M 
 --conf spark.yarn.driver.memoryOverhead=710M 
 --conf spark.yarn.executor.memoryOverhead=710M --driver-cores 3 
--executor-cores 3 
 --conf spark.dynamicAllocation.minExecutors=1 
 --conf spark.dynamicAllocation.maxExecutors=2 
 --conf spark.dynamicAllocation.initialExecutors=2 
 --conf spark.locality.wait.node=0 
 --conf spark.dynamicAllocation.enabled=true 
 --conf maximizeResourceAllocation=false --class  
 --conf spark.streaming.driver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.scheduler.mode=FAIR 
 --conf spark.metrics.conf=.properties 
--files=s3:///.properties 
 --conf spark.streaming.receiver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.streaming.receiver.writeAheadLog.enable=true 
 --conf spark.streaming.receiver.blockStoreTimeout=59 

[jira] [Resolved] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30665.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27376
[https://github.com/apache/spark/pull/27376]

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30665:


Assignee: Nicholas Chammas

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
 

*+Stream+*

We have observed discrepancy in  kinesis stream, whereas stream has continuous 
incoming records but GetRecords.Records is not available.

 

Upon analysis, we have understood that there were no GetRecords calls made by 
Spark Job during the time due to which the GetRecords count is not available, 
hence there should not be any issues with streams as the messages were being 
received.

*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

+*_Driver Logs_*+

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

+*_Executor Logs_*+

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

 

*After this point, the Kinesis KCL worker seemed to be terminated which was 
reading the Queue, due to which we could see the gap in the GetRecords.*  

 

+*Mitigation*+

Increased the timeout
 * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
30 seconds) 
 * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
default - 5seconds)

 

Note : 
 1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket

2. Spark submit Configuration as below:

spark-submit --deploy-mode cluster --executor-memory 4608M --driver-memory 
4608M 
 --conf spark.yarn.driver.memoryOverhead=710M 
 --conf spark.yarn.executor.memoryOverhead=710M --driver-cores 3 
--executor-cores 3 
 --conf spark.dynamicAllocation.minExecutors=1 
 --conf spark.dynamicAllocation.maxExecutors=2 
 --conf spark.dynamicAllocation.initialExecutors=2 
 --conf spark.locality.wait.node=0 
 --conf spark.dynamicAllocation.enabled=true 
 --conf maximizeResourceAllocation=false --class  
 --conf spark.streaming.driver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.scheduler.mode=FAIR 
 --conf spark.metrics.conf=.properties 
--files=s3:///.properties 
 --conf spark.streaming.receiver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.streaming.receiver.writeAheadLog.enable=true 
 --conf spark.streaming.receiver.blockStoreTimeout=59 

[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
 

*+Stream+*

We have observed discrepancy in  kinesis stream, whereas stream has continuous 
incoming records but GetRecords.Records is not available.

 

Upon analysis, we have understood that there were no GetRecords calls made by 
Spark Job during the time due to which the GetRecords count is not available, 
hence there should not be any issues with streams as the messages were being 
received.

*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

+*_Driver Logs_*+

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

+*_Executor Logs_*+

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

 

*After this point, the Kinesis KCL worker seemed to be terminated which was 
reading the Queue, due to which we could see the gap in the GetRecords.*  

 

+*Mitigation*+

Increased the timeout
 * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
30 seconds) 
 * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
default - 5seconds)

 

Note : 
1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket

2. Spark submit Configuration as below:

spark-submit --deploy-mode cluster --executor-memory 4608M --driver-memory 
4608M 
--conf spark.yarn.driver.memoryOverhead=710M 
--conf spark.yarn.executor.memoryOverhead=710M --driver-cores 3 
--executor-cores 3 
--conf spark.dynamicAllocation.minExecutors=1 
--conf spark.dynamicAllocation.maxExecutors=2 
--conf spark.dynamicAllocation.initialExecutors=2 
--conf spark.locality.wait.node=0 
--conf spark.dynamicAllocation.enabled=true 
--conf maximizeResourceAllocation=false --class  
--conf spark.streaming.driver.writeAheadLog.closeFileAfterWrite=true 
--conf spark.scheduler.mode=FAIR 
--conf spark.metrics.conf=.properties 
--files=s3:///.properties 
--conf spark.streaming.receiver.writeAheadLog.closeFileAfterWrite=true 
--conf spark.streaming.receiver.writeAheadLog.enable=true 
--conf spark.streaming.receiver.blockStoreTimeout=59 
--conf 

[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
 

*+Stream+*

We have observed discrepancy in  kinesis stream, whereas stream has continuous 
incoming records but GetRecords.Records is not available.

 

Upon analysis, we have understood that there were no GetRecords calls made by 
Spark Job during the time due to which the GetRecords count is not available, 
hence there should not be any issues with streams as the messages were being 
received.

*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

+*_Driver Logs_*+

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

+*_Executor Logs_*+

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

 

*After this point, the Kinesis KCL worker seemed to be terminated which was 
reading the Queue, due to which we could see the gap in the GetRecords.*  

 

+*Mitigation*+

Increased the timeout
 * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
30 seconds) 
 * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
default - 5seconds)

 

Note : 
 1. Writeahead logs and Checkpoint is being maitained in AWS S3 bucket

2. Spark submit Configuration as below:

spark-submit --deploy-mode cluster --executor-memory 4608M --driver-memory 
4608M 
 --conf spark.yarn.driver.memoryOverhead=710M 
 --conf spark.yarn.executor.memoryOverhead=710M --driver-cores 3 
--executor-cores 3 
 --conf spark.dynamicAllocation.minExecutors=1 
 --conf spark.dynamicAllocation.maxExecutors=2 
 --conf spark.dynamicAllocation.initialExecutors=2 
 --conf spark.locality.wait.node=0 
 --conf spark.dynamicAllocation.enabled=true 
 --conf maximizeResourceAllocation=false --class  
 --conf spark.streaming.driver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.scheduler.mode=FAIR 
 --conf spark.metrics.conf=.properties 
--files=s3:///.properties 
 --conf spark.streaming.receiver.writeAheadLog.closeFileAfterWrite=true 
 --conf spark.streaming.receiver.writeAheadLog.enable=true 
 --conf spark.streaming.receiver.blockStoreTimeout=59 

[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
 

*+Stream+*

We have observed discrepancy in  kinesis stream, whereas stream has continuous 
incoming records but GetRecords.Records is not available.

 

Upon analysis, we have understood that there were no GetRecords calls made by 
Spark Job during the time due to which the GetRecords count is not available, 
hence there should not be any issues with streams as the messages were being 
received.

*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

+*_Driver Logs_*+

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

+*_Executor Logs_*+

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

 

*After this point, the Kinesis KCL worker seemed to be terminated which was 
reading the Queue, due to which we could see the gap in the GetRecords.*  

 

+*Mitigation*+

Increased the timeout
 * 'spark.streaming.receiver.blockStoreTimeout’ to 59 seconds (from default - 
30 seconds) 
 * 'spark.streaming.driver.writeAheadLog.batchingTimeout’ to 30 seconds (from 
default - 5seconds)

  was:
*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

 

 

*# Driver Logs*

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 

[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30675:
-
Component/s: (was: Structured Streaming)
 (was: Spark Submit)
 DStreams

> Spark Streaming Job stopped reading events from Queue upon Deregister 
> Exception
> ---
>
> Key: SPARK-30675
> URL: https://issues.apache.org/jira/browse/SPARK-30675
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
>
> *+Spark/EMR+*
> From the driver logs, it has been found that the driver de-registered the 
> receiver for the stream
>  
>  
> *# Driver Logs*
> 2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
> receiver for stream 0: Error while storing block into Spark - 
> java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]{color}*
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>     at 
> org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>     at 
> org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
>     at 
> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)
>     ...
> *Till this point, there is no receiver being started/registered. From the 
> executor logs (below), it has been observed that one of the executors was 
> running on the container.*
>  
> *# Executer Logs* 
> 2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002
> 2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
> {color:#de350b}*S**topping receiver with message: Error while storing block 
> into Spark: java.util.concurrent.TimeoutException: Futures timed out after 
> [30 seconds]*{color}
> 2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.
> 2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  has successfully stopped lease-tracking threads
> 2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting 
> down workerId 
> ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
>  with reason ZOMBIE
> 2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
> as bytes in memory (estimated size /7.3 KB, free 3.4 GB)
> 2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
> down successfully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30675:
-
Description: 
*+Spark/EMR+*

>From the driver logs, it has been found that the driver de-registered the 
>receiver for the stream

 

 

*# Driver Logs*

2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
receiver for stream 0: Error while storing block into Spark - 
java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]{color}*

    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

    at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)

    at 
org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

    at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)

    at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)

    at 
org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:344)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

    at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

    at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

    ...

*Till this point, there is no receiver being started/registered. From the 
executor logs (below), it has been observed that one of the executors was 
running on the container.*

 

*# Executer Logs* 

2020-01-03 11:11:30 INFO  BlockManager:54 - Removing RDD 2851002

2020-01-03 11:11:31 INFO  ReceiverSupervisorImpl:54 - 
{color:#de350b}*S**topping receiver with message: Error while storing block 
into Spark: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]*{color}

2020-01-03 11:11:31 INFO  Worker:593 - Worker shutdown requested.

2020-01-03 11:11:31 INFO  LeaseCoordinator:298 - Worker 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 has successfully stopped lease-tracking threads

2020-01-03 11:11:31 INFO  KinesisRecordProcessor:54 - Shutdown:  Shutting down 
workerId 
ip-10-61-71-29.ap-southeast-2.compute.internal:a7567f14-16be-4aca-8f64-401b0b29aea2
 with reason ZOMBIE

2020-01-03 11:11:32 INFO  MemoryStore:54 - Block input-0-1575374565339 stored 
as bytes in memory (estimated size /7.3 KB, free 3.4 GB)

2020-01-03 11:11:33 INFO  Worker:634 - All record processors have been shut 
down successfully.

> Spark Streaming Job stopped reading events from Queue upon Deregister 
> Exception
> ---
>
> Key: SPARK-30675
> URL: https://issues.apache.org/jira/browse/SPARK-30675
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Submit, Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
>
> *+Spark/EMR+*
> From the driver logs, it has been found that the driver de-registered the 
> receiver for the stream
>  
>  
> *# Driver Logs*
> 2020-01-03 11:11:40 ERROR ReceiverTracker:70 - *{color:#de350b}Deregistered 
> receiver for stream 0: Error while storing block into Spark - 
> java.util.concurrent.TimeoutException: Futures timed out after [30 
> seconds]{color}*
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>     at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>     at 
> org.apache.spark.streaming.receiver.{color:#de350b}*WriteAheadLogBasedBlockHandler.storeBlock*{color}(ReceivedBlockHandler.scala:210)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
>     at 
> org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
>     at 
> org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
>     at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:293)
>     

[jira] [Created] (SPARK-30675) Spark Streaming Job stopped reading events from Queue upon Deregister Exception

2020-01-29 Thread Mullaivendhan Ariaputhri (Jira)
Mullaivendhan Ariaputhri created SPARK-30675:


 Summary: Spark Streaming Job stopped reading events from Queue 
upon Deregister Exception
 Key: SPARK-30675
 URL: https://issues.apache.org/jira/browse/SPARK-30675
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Submit, Structured Streaming
Affects Versions: 2.4.3
Reporter: Mullaivendhan Ariaputhri






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30674) Use python3 in dev/lint-python

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30674:
--
Description: `lint-python` fails at python2. We had better use python3 
explicitly.  (was: `lint-python` fails at python2. We had better migrate.)

> Use python3 in dev/lint-python
> --
>
> Key: SPARK-30674
> URL: https://issues.apache.org/jira/browse/SPARK-30674
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `lint-python` fails at python2. We had better use python3 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30674) Use python3 in dev/lint-python

2020-01-29 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30674:
-

 Summary: Use python3 in dev/lint-python
 Key: SPARK-30674
 URL: https://issues.apache.org/jira/browse/SPARK-30674
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


`lint-python` fails at python2. We had better migrate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30673) Test cases in HiveShowCreateTableSuite should create Hive table instead of Datasource table

2020-01-29 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-30673:
---

 Summary: Test cases in HiveShowCreateTableSuite should create Hive 
table instead of Datasource table
 Key: SPARK-30673
 URL: https://issues.apache.org/jira/browse/SPARK-30673
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Because SparkSQL now creates data source table if no provider is specified in 
SQL command, some test cases in HiveShowCreateTableSuite don't create Hive 
table, but data source table.

It is confusing and not good for the purpose of this test suite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30665:
-
Component/s: Build

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30665:
-
Component/s: (was: Build)
 Documentation

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.

2020-01-29 Thread sakshi chourasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025611#comment-17025611
 ] 

sakshi chourasia edited comment on SPARK-21823 at 1/30/20 5:20 AM:
---

Hi [~ksunitha]

I work with spark sql. Need this command to alter the column name. I have some 
100+ table with 4 columns to be renamed in each, present in Prod and Dev env. 
May I know when we are planning to let this command out.


was (Author: sakshi49):
Hi [~ksunitha]

I work with spark sql. Need this command to alter the column name. I have some 
100+ table with 4 columns to be renamed in each present in Prod and Dev env. 
May I know when we are planning to let this command out.

> ALTER TABLE table statements  such as RENAME and CHANGE columns should  raise 
>  error if there are any dependent constraints. 
> -
>
> Key: SPARK-21823
> URL: https://issues.apache.org/jira/browse/SPARK-21823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Following ALTER TABLE DDL statements will impact  the  informational 
> constraints defined on a table:
> {code:sql}
> ALTER TABLE name RENAME TO new_name
> ALTER TABLE name CHANGE column_name new_name new_type
> {code}
> Spark SQL should raise errors if there are 
 informational constraints 
> defined on the columns  affected by the ALTER  and let the user drop 
> constraints before proceeding with the DDL. In the future we can enhance the  
> ALTER  to automatically fix up the constraint definition in the catalog when 
> possible, and not raise error
> When spark adds support for DROP/REPLACE of columns they will impact 
> informational constraints.
> {code:sql}
> ALTER TABLE name DROP [COLUMN] column_name
> ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30435) update Spark SQL guide of Supported Hive Features

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30435:
-

Assignee: angerszhu

> update Spark SQL guide of Supported Hive Features
> -
>
> Key: SPARK-30435
> URL: https://issues.apache.org/jira/browse/SPARK-30435
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30435) update Spark SQL guide of Supported Hive Features

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30435.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27106
[https://github.com/apache/spark/pull/27106]

> update Spark SQL guide of Supported Hive Features
> -
>
> Key: SPARK-30435
> URL: https://issues.apache.org/jira/browse/SPARK-30435
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30672) numpy is a dependency for building PySpark API docs

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30672.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27390
[https://github.com/apache/spark/pull/27390]

> numpy is a dependency for building PySpark API docs
> ---
>
> Key: SPARK-30672
> URL: https://issues.apache.org/jira/browse/SPARK-30672
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>
> As described here: 
> https://github.com/apache/spark/pull/27376#discussion_r372550656



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30672) numpy is a dependency for building PySpark API docs

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30672:


Assignee: Nicholas Chammas

> numpy is a dependency for building PySpark API docs
> ---
>
> Key: SPARK-30672
> URL: https://issues.apache.org/jira/browse/SPARK-30672
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> As described here: 
> https://github.com/apache/spark/pull/27376#discussion_r372550656



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.

2020-01-29 Thread sakshi chourasia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025611#comment-17025611
 ] 

sakshi chourasia edited comment on SPARK-21823 at 1/30/20 3:56 AM:
---

Hi [~ksunitha]

I work with spark sql. Need this command to alter the column name. I have some 
100+ table with 4 columns to be renamed in each present in Prod and Dev env. 
May I know when we are planning to let this command out.


was (Author: sakshi49):
Hi [~ksunitha]

I work with spark sql. Need this command to alter the table name. I have some 
100+ table present in Prod and Dev env. May I know when we are planning to let 
this command out.

> ALTER TABLE table statements  such as RENAME and CHANGE columns should  raise 
>  error if there are any dependent constraints. 
> -
>
> Key: SPARK-21823
> URL: https://issues.apache.org/jira/browse/SPARK-21823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Following ALTER TABLE DDL statements will impact  the  informational 
> constraints defined on a table:
> {code:sql}
> ALTER TABLE name RENAME TO new_name
> ALTER TABLE name CHANGE column_name new_name new_type
> {code}
> Spark SQL should raise errors if there are 
 informational constraints 
> defined on the columns  affected by the ALTER  and let the user drop 
> constraints before proceeding with the DDL. In the future we can enhance the  
> ALTER  to automatically fix up the constraint definition in the catalog when 
> possible, and not raise error
> When spark adds support for DROP/REPLACE of columns they will impact 
> informational constraints.
> {code:sql}
> ALTER TABLE name DROP [COLUMN] column_name
> ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30665) Eliminate pypandoc dependency

2020-01-29 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30665:
-
Summary: Eliminate pypandoc dependency  (was: Remove Pandoc dependency in 
PySpark setup.py)

> Eliminate pypandoc dependency
> -
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30672) numpy is a dependency for building PySpark API docs

2020-01-29 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30672:


 Summary: numpy is a dependency for building PySpark API docs
 Key: SPARK-30672
 URL: https://issues.apache.org/jira/browse/SPARK-30672
 Project: Spark
  Issue Type: Bug
  Components: Build, PySpark
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


As described here: 
https://github.com/apache/spark/pull/27376#discussion_r372550656



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-29 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026399#comment-17026399
 ] 

Nicholas Chammas commented on SPARK-30665:
--



> Remove Pandoc dependency in PySpark setup.py
> 
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30618) Why does SparkSQL allow `WHERE` to be table alias?

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30618.
--
Resolution: Invalid

Please ask questions into mailing list. See 
[https://spark.apache.org/contributing.html]

> Why does SparkSQL allow `WHERE` to be table alias?
> --
>
> Key: SPARK-30618
> URL: https://issues.apache.org/jira/browse/SPARK-30618
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Chunjun Xiao
>Priority: Minor
>
> An empty `WHERE` expression is valid in Spark SQL, as: `SELECT * FROM XXX 
> WHERE`. Here `WHERE` is parsed as the table alias.
> I think this surprises most SQL users, as this is an invalid statement in 
> some SQL engines like MySQL.  
> I checked the source code, and found more keywords (in most SQL system) are 
> treated as `noReserved` and allowed to be table alias.  
> Could anyone please give the rationality behind this decision?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026391#comment-17026391
 ] 

Hyukjin Kwon commented on SPARK-30619:
--

[~abhisrao] can you show reproducer and error messages?

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30643) Add support for embedding Hive 3

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30643.
--
Resolution: Later

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30643) Add support for embedding Hive 3

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026389#comment-17026389
 ] 

Hyukjin Kwon commented on SPARK-30643:
--

Yeah, it needs some huge efforts to upgrade this, and a lot of changes. Unless 
there are some strong reasons to do it, I wouldn't do it.
There are so many reasons specifically why we had to upgrade to Hive 2.3 (e.g., 
Hadoop 3, removing its own fork, etc.). Hive 3 doesn't seem sharing the reason.

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30646) transform_keys function throws exception as "Cannot use null as map key", but there isn't any null key in the map

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026385#comment-17026385
 ] 

Hyukjin Kwon commented on SPARK-30646:
--

Use concat to concatenate strings


{code:java}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(Map("EID_1"->1,"EID_2"->25000)).toDF("employees")
df: org.apache.spark.sql.DataFrame = [employees: map]

scala> df.withColumn("employees",transform_keys($"employees", (k,v) => 
concat(k, lit("XYX".show()
++
|   employees|
++
|[EID_1XYX -> 1000...|
++
{code}

> transform_keys function throws exception as "Cannot use null as map key", but 
> there isn't any null key in the map 
> --
>
> Key: SPARK-30646
> URL: https://issues.apache.org/jira/browse/SPARK-30646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arun Jijo
>Priority: Major
>
> Have started experimenting Spark 3.0 new SQL functions and along the way 
> found an issue with the *transform_keys* function. It is raising "Cannot use 
> null as map key" exception but the Map actually doesn't hold any Null values.
> Find my spark code below to reproduce the error.
> {code:java}
> val df = Seq(Map("EID_1"->1,"EID_2"->25000)).toDF("employees")
> df.withColumn("employees",transform_keys($"employees",(k,v)=>lit(k.+("XYX"
>.show
> {code}
> Exception in thread "main" java.lang.RuntimeException: *Cannot use null as 
> map key*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30646) transform_keys function throws exception as "Cannot use null as map key", but there isn't any null key in the map

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30646.
--
Resolution: Invalid

> transform_keys function throws exception as "Cannot use null as map key", but 
> there isn't any null key in the map 
> --
>
> Key: SPARK-30646
> URL: https://issues.apache.org/jira/browse/SPARK-30646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arun Jijo
>Priority: Major
>
> Have started experimenting Spark 3.0 new SQL functions and along the way 
> found an issue with the *transform_keys* function. It is raising "Cannot use 
> null as map key" exception but the Map actually doesn't hold any Null values.
> Find my spark code below to reproduce the error.
> {code:java}
> val df = Seq(Map("EID_1"->1,"EID_2"->25000)).toDF("employees")
> df.withColumn("employees",transform_keys($"employees",(k,v)=>lit(k.+("XYX"
>.show
> {code}
> Exception in thread "main" java.lang.RuntimeException: *Cannot use null as 
> map key*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30647.
--
Resolution: Incomplete

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026374#comment-17026374
 ] 

Hyukjin Kwon commented on SPARK-30647:
--

I think this is fixed in the latest versions of Spark. Spark 2.3.x is EOL 
anyway so please verify if this exists in at least Spark 2.4.x or in the 
master. I am resolving this until then.

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30649) Azure Spark read : ContentMD5 header is missing in the response

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026372#comment-17026372
 ] 

Hyukjin Kwon commented on SPARK-30649:
--

Firstly please just don't copy and paste the error messages. Here's for filing 
an issue, not requesting investigation. Please show the full reproducer if 
possible
Secondly, Spark 2.3.x is EOL. I am going to resolve this until it's verified at 
least in 2.4.x or in the master.

> Azure Spark read :  ContentMD5 header is missing in the response
> 
>
> Key: SPARK-30649
> URL: https://issues.apache.org/jira/browse/SPARK-30649
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are reading from azure csv file, We are getting exception as Content 
> MD5 header is missing in the response.
>  
>  
> 2020-01-27 11:03:12.255 ERROR 8 --- [r for task 1030] 
> o.a.h.fs.azure.NativeAzureFileSystem : Encountered Storage Exception for read 
> on Blob : 
> Performance_Dataset/PR_DS_1cr.csv/part-0-1af0b4b3-018e-4847-9441-3e5239c94e33-c000.csv
>  Exception details: java.io.IOException Error Code : MissingContentMD5Header 
> 2020-01-27 11:03:12.258 ERROR 8 --- [r for task 1030] 
> org.apache.spark.executor.Executor : Exception in task 0.0 in stage 626.0 
> (TID 1030)
> java.io.IOException: null at 
> com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) 
> ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:264)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:448)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.BlockBlobInputStream.read(BlockBlobInputStream.java:281)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:882)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:284) 
> ~[na:1.8.0_212] at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_212] at java.io.DataInputStream.read(DataInputStream.java:149) 
> ~[na:1.8.0_212] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:176) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[na:na] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> 

[jira] [Resolved] (SPARK-30649) Azure Spark read : ContentMD5 header is missing in the response

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30649.
--
Resolution: Incomplete

> Azure Spark read :  ContentMD5 header is missing in the response
> 
>
> Key: SPARK-30649
> URL: https://issues.apache.org/jira/browse/SPARK-30649
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are reading from azure csv file, We are getting exception as Content 
> MD5 header is missing in the response.
>  
>  
> 2020-01-27 11:03:12.255 ERROR 8 --- [r for task 1030] 
> o.a.h.fs.azure.NativeAzureFileSystem : Encountered Storage Exception for read 
> on Blob : 
> Performance_Dataset/PR_DS_1cr.csv/part-0-1af0b4b3-018e-4847-9441-3e5239c94e33-c000.csv
>  Exception details: java.io.IOException Error Code : MissingContentMD5Header 
> 2020-01-27 11:03:12.258 ERROR 8 --- [r for task 1030] 
> org.apache.spark.executor.Executor : Exception in task 0.0 in stage 626.0 
> (TID 1030)
> java.io.IOException: null at 
> com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) 
> ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:264)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:448)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.BlockBlobInputStream.read(BlockBlobInputStream.java:281)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:882)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:284) 
> ~[na:1.8.0_212] at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_212] at java.io.DataInputStream.read(DataInputStream.java:149) 
> ~[na:1.8.0_212] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:176) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[na:na] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  ~[spark-core_2.11-2.3.1.jar!/:2.3.1] at 
> 

[jira] [Updated] (SPARK-30649) Azure Spark read : ContentMD5 header is missing in the response

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30649:
-
Priority: Major  (was: Blocker)

> Azure Spark read :  ContentMD5 header is missing in the response
> 
>
> Key: SPARK-30649
> URL: https://issues.apache.org/jira/browse/SPARK-30649
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are reading from azure csv file, We are getting exception as Content 
> MD5 header is missing in the response.
>  
>  
> 2020-01-27 11:03:12.255 ERROR 8 --- [r for task 1030] 
> o.a.h.fs.azure.NativeAzureFileSystem : Encountered Storage Exception for read 
> on Blob : 
> Performance_Dataset/PR_DS_1cr.csv/part-0-1af0b4b3-018e-4847-9441-3e5239c94e33-c000.csv
>  Exception details: java.io.IOException Error Code : MissingContentMD5Header 
> 2020-01-27 11:03:12.258 ERROR 8 --- [r for task 1030] 
> org.apache.spark.executor.Executor : Exception in task 0.0 in stage 626.0 
> (TID 1030)
> java.io.IOException: null at 
> com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) 
> ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(BlobInputStream.java:264)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:448)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420)
>  ~[azure-storage-5.0.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.BlockBlobInputStream.read(BlockBlobInputStream.java:281)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:882)
>  ~[hadoop-azure-2.9.0.jar!/:na] at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:284) 
> ~[na:1.8.0_212] at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_212] at java.io.DataInputStream.read(DataInputStream.java:149) 
> ~[na:1.8.0_212] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:176) 
> ~[hadoop-common-2.9.0.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>  ~[hadoop-mapreduce-client-core-2.7.2.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 
> ~[scala-library-2.11.12.jar!/:na] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[na:na] at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  ~[spark-sql_2.11-2.3.1.jar!/:2.3.1] at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  ~[spark-core_2.11-2.3.1.jar!/:2.3.1] at 
> 

[jira] [Commented] (SPARK-30650) The parquet file written by spark often incurs corrupted footer and hence not readable

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026369#comment-17026369
 ] 

Hyukjin Kwon commented on SPARK-30650:
--

Spark versions before 2.3 are EOL. Can you verify if there are similar issues 
in at least Spark 2.4.x? I am leaving this resolved before that's verified.

> The parquet file written by spark often incurs corrupted footer and hence not 
> readable 
> ---
>
> Key: SPARK-30650
> URL: https://issues.apache.org/jira/browse/SPARK-30650
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output, Optimizer
>Affects Versions: 1.6.1
>Reporter: DILIP KUMAR MOHAPATRO
>Priority: Major
>
> This issue is similar to an archived one,
> [https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E]
> The parquet file written by spark often incurs corrupted footer and hence not 
> readable by spark.
> The issue is more consistent when the granularity of a field increases. i.e. 
> when redundancy of values in dataset is reduced(= more number of unique 
> values).
> Coalesce also doesn't help here. It automatically generated a certain number 
> of parquet files, each with a definite size as controlled by spark internals. 
> But, few of them written corrupted footer. But writing job ends with success 
> status. 
> Here are few examples,
> There are the files(267.2 M each) which the 1.6.x version spark has 
> generated. But few of them are found with corrupted footer and hence not 
> readable. This scenario happens more frequently when the file(input) size 
> exceeds a certain limit and also the level of redundancy of the data matters. 
> With the same file size, Lesser the level of redundancy, more is the 
> probability of getting the footer corrupted.
> Hence in iterations of the job when those are required to read for 
> processing, ends up with
> {{{color:#FF}*Can not read value 0 in block _n_ in file *{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30650) The parquet file written by spark often incurs corrupted footer and hence not readable

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30650.
--
Resolution: Incomplete

> The parquet file written by spark often incurs corrupted footer and hence not 
> readable 
> ---
>
> Key: SPARK-30650
> URL: https://issues.apache.org/jira/browse/SPARK-30650
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output, Optimizer
>Affects Versions: 1.6.1
>Reporter: DILIP KUMAR MOHAPATRO
>Priority: Major
>
> This issue is similar to an archived one,
> [https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3cjira.12767358.1421214067000.78480.1421214094...@atlassian.jira%3E]
> The parquet file written by spark often incurs corrupted footer and hence not 
> readable by spark.
> The issue is more consistent when the granularity of a field increases. i.e. 
> when redundancy of values in dataset is reduced(= more number of unique 
> values).
> Coalesce also doesn't help here. It automatically generated a certain number 
> of parquet files, each with a definite size as controlled by spark internals. 
> But, few of them written corrupted footer. But writing job ends with success 
> status. 
> Here are few examples,
> There are the files(267.2 M each) which the 1.6.x version spark has 
> generated. But few of them are found with corrupted footer and hence not 
> readable. This scenario happens more frequently when the file(input) size 
> exceeds a certain limit and also the level of redundancy of the data matters. 
> With the same file size, Lesser the level of redundancy, more is the 
> probability of getting the footer corrupted.
> Hence in iterations of the job when those are required to read for 
> processing, ends up with
> {{{color:#FF}*Can not read value 0 in block _n_ in file *{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29578) JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026365#comment-17026365
 ] 

Dongjoon Hyun commented on SPARK-29578:
---

This lands at `branch-2.4` via [https://github.com/apache/spark/pull/27386] .

> JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again
> 
>
> Key: SPARK-29578
> URL: https://issues.apache.org/jira/browse/SPARK-29578
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> I have a report that tests fail in JDK 1.8.0_232 because of timezone changes 
> in (I believe) tzdata2018i or later, per 
> https://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html:
> {{*** FAILED *** with 8634 did not equal 8633 Round trip of 8633 did not work 
> in tz}}
> See also https://issues.apache.org/jira/browse/SPARK-24950
> I say "I've heard" because I can't get this version easily on my Mac. However 
> the fix is so inconsequential that I think we can just make it, to allow this 
> additional variation just as before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29578) JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29578:
--
Fix Version/s: 2.4.5

> JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again
> 
>
> Key: SPARK-29578
> URL: https://issues.apache.org/jira/browse/SPARK-29578
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> I have a report that tests fail in JDK 1.8.0_232 because of timezone changes 
> in (I believe) tzdata2018i or later, per 
> https://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html:
> {{*** FAILED *** with 8634 did not equal 8633 Round trip of 8633 did not work 
> in tz}}
> See also https://issues.apache.org/jira/browse/SPARK-24950
> I say "I've heard" because I can't get this version easily on my Mac. However 
> the fix is so inconsequential that I think we can just make it, to allow this 
> additional variation just as before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30670) Pipes for PySpark

2020-01-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026354#comment-17026354
 ] 

Hyukjin Kwon commented on SPARK-30670:
--

There is already {{transform}}. 

> Pipes for PySpark
> -
>
> Key: SPARK-30670
> URL: https://issues.apache.org/jira/browse/SPARK-30670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Vincent
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
> functional programming pattern that is inspired from the tidyverse that is 
> currently missing. The pandas community also recently adopted this pattern, 
> documented [here]([https://tomaugspurger.github.io/method-chaining.html).]
> This is the idea. Suppose you had this;
> {code:java}
> # file that has [user, date, timestamp, eventtype]
> ddf = spark.read.parquet("")
> w_user = Window().partitionBy("user")
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> thres_sesstime = 60 * 15 
> min_n_rows = 10
> min_n_sessions = 5
> clean_ddf = (ddf
>   .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
>   .withColumn("new_session", (sf.col("delta") > 
> thres_sesstime).cast("integer"))
>   .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
>   .drop("new_session")
>   .drop("delta")
>   .withColumn("nrow_user", sf.count(sf.col("timestamp")))
>   .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
>   .filter(sf.col("nrow_user") > min_n_rows)
>   .filter(sf.col("nrow_user_date") > min_n_sessions)
>   .drop("nrow_user")
>   .drop("nrow_user_date"))
> {code}
> The code works and it is somewhat clear. We add a session to the dataframe 
> and then we use this to remove outliers. The issue is that this chain of 
> commands can get quite long so instead it might be better to turn this into 
> functions.
> {code:java}
> def add_session(dataf, session_threshold=60*15):
> w_user = Window().partitionBy("user")
>   
> return (dataf  
> .withColumn("delta", sf.col("timestamp") - 
> sf.lag("timestamp").over(w_user))
> .withColumn("new_session", (sf.col("delta") > 
> threshold_sesstime).cast("integer"))
> .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
> .drop("new_session")
> .drop("delta"))
> def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> 
> return (dataf  
> .withColumn("nrow_user", sf.count(sf.col("timestamp")))
> .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
> .filter(sf.col("nrow_user") > min_n_rows)
> .filter(sf.col("nrow_user_date") > min_n_sessions)
> .drop("nrow_user")
> .drop("nrow_user_date"))
> {code}
> The issue lies not in these functions. These functions are great! You can 
> unit test them and they really give nice verbs that function as an 
> abstraction. The issue is in how you now need to apply them. 
> {code:java}
> remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
> {code}
> It'd be much nicer to perhaps allow for this;
> {code:java}
> (ddf
>   .pipe(add_session, session_threshold=900)
>   .pipe(remove_outliers, min_n_rows=11))
> {code}
> The cool thing about this is that you can really easily allow for method 
> chaining but also that you have an amazing way to split high level code and 
> low level code. You still allow mutation as a high level by exposing keyword 
> arguments but you can easily find the lower level code in debugging because 
> you've contained details to their functions.
> For code maintenance, I've relied on this pattern a lot personally. But 
> sofar, I've always monkey-patched spark to be able to do this.
> {code:java}
> from pyspark.sql import DataFrame 
> def pipe(self, func, *args, **kwargs):
> return func(self, *args, **kwargs)
> {code}
> Could I perhaps add these few lines of code to the codebase?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30670) Pipes for PySpark

2020-01-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30670.
--
Resolution: Duplicate

> Pipes for PySpark
> -
>
> Key: SPARK-30670
> URL: https://issues.apache.org/jira/browse/SPARK-30670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Vincent
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
> functional programming pattern that is inspired from the tidyverse that is 
> currently missing. The pandas community also recently adopted this pattern, 
> documented [here]([https://tomaugspurger.github.io/method-chaining.html).]
> This is the idea. Suppose you had this;
> {code:java}
> # file that has [user, date, timestamp, eventtype]
> ddf = spark.read.parquet("")
> w_user = Window().partitionBy("user")
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> thres_sesstime = 60 * 15 
> min_n_rows = 10
> min_n_sessions = 5
> clean_ddf = (ddf
>   .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
>   .withColumn("new_session", (sf.col("delta") > 
> thres_sesstime).cast("integer"))
>   .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
>   .drop("new_session")
>   .drop("delta")
>   .withColumn("nrow_user", sf.count(sf.col("timestamp")))
>   .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
>   .filter(sf.col("nrow_user") > min_n_rows)
>   .filter(sf.col("nrow_user_date") > min_n_sessions)
>   .drop("nrow_user")
>   .drop("nrow_user_date"))
> {code}
> The code works and it is somewhat clear. We add a session to the dataframe 
> and then we use this to remove outliers. The issue is that this chain of 
> commands can get quite long so instead it might be better to turn this into 
> functions.
> {code:java}
> def add_session(dataf, session_threshold=60*15):
> w_user = Window().partitionBy("user")
>   
> return (dataf  
> .withColumn("delta", sf.col("timestamp") - 
> sf.lag("timestamp").over(w_user))
> .withColumn("new_session", (sf.col("delta") > 
> threshold_sesstime).cast("integer"))
> .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
> .drop("new_session")
> .drop("delta"))
> def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> 
> return (dataf  
> .withColumn("nrow_user", sf.count(sf.col("timestamp")))
> .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
> .filter(sf.col("nrow_user") > min_n_rows)
> .filter(sf.col("nrow_user_date") > min_n_sessions)
> .drop("nrow_user")
> .drop("nrow_user_date"))
> {code}
> The issue lies not in these functions. These functions are great! You can 
> unit test them and they really give nice verbs that function as an 
> abstraction. The issue is in how you now need to apply them. 
> {code:java}
> remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
> {code}
> It'd be much nicer to perhaps allow for this;
> {code:java}
> (ddf
>   .pipe(add_session, session_threshold=900)
>   .pipe(remove_outliers, min_n_rows=11))
> {code}
> The cool thing about this is that you can really easily allow for method 
> chaining but also that you have an amazing way to split high level code and 
> low level code. You still allow mutation as a high level by exposing keyword 
> arguments but you can easily find the lower level code in debugging because 
> you've contained details to their functions.
> For code maintenance, I've relied on this pattern a lot personally. But 
> sofar, I've always monkey-patched spark to be able to do this.
> {code:java}
> from pyspark.sql import DataFrame 
> def pipe(self, func, *args, **kwargs):
> return func(self, *args, **kwargs)
> {code}
> Could I perhaps add these few lines of code to the codebase?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30529) Improve error messages when Executor dies before registering with driver

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30529.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27385
[https://github.com/apache/spark/pull/27385]

> Improve error messages when Executor dies before registering with driver
> 
>
> Key: SPARK-30529
> URL: https://issues.apache.org/jira/browse/SPARK-30529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> currently when you give a bad configuration for accelerator aware scheduling 
> to the executor, the Executors can die but its hard for the user to know why. 
>  The executor dies and logs in its log files what went wrong but many times 
> it hard to find those logs because the executor hasn't registered yet.  Since 
> it hasn't registered the executor doesn't show up on UI to see log files.
> One specific example is you give a discovery script that that doesn't find 
> all the GPUs:
> {code}
> 20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@10.28.9.112:44403
> 20/01/16 08:59:24 ERROR Inbox: Ignoring error
> java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
> addresses: 0 is less than what the user requested: 2)
>  at scala.Predef$.require(Predef.scala:281)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
> {code}
>  
> Figure out a better way of logging or letting user know  what error occurred 
> when the executor dies before registering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30529) Improve error messages when Executor dies before registering with driver

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30529:
-

Assignee: Thomas Graves

> Improve error messages when Executor dies before registering with driver
> 
>
> Key: SPARK-30529
> URL: https://issues.apache.org/jira/browse/SPARK-30529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> currently when you give a bad configuration for accelerator aware scheduling 
> to the executor, the Executors can die but its hard for the user to know why. 
>  The executor dies and logs in its log files what went wrong but many times 
> it hard to find those logs because the executor hasn't registered yet.  Since 
> it hasn't registered the executor doesn't show up on UI to see log files.
> One specific example is you give a discovery script that that doesn't find 
> all the GPUs:
> {code}
> 20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@10.28.9.112:44403
> 20/01/16 08:59:24 ERROR Inbox: Ignoring error
> java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
> addresses: 0 is less than what the user requested: 2)
>  at scala.Predef$.require(Predef.scala:281)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
> {code}
>  
> Figure out a better way of logging or letting user know  what error occurred 
> when the executor dies before registering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29419) Seq.toDS / spark.createDataset(Seq) is not thread-safe

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026313#comment-17026313
 ] 

Dongjoon Hyun commented on SPARK-29419:
---

I switched this to `Bug` because this is marked as `correctness` as a blocker 
for 3.0.0.

> Seq.toDS / spark.createDataset(Seq) is not thread-safe
> --
>
> Key: SPARK-29419
> URL: https://issues.apache.org/jira/browse/SPARK-29419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if 
> the caller-supplied {{Encoder}} is used in multiple threads then 
> {{createDataset}}'s usage of the encoder may lead to incorrect answers 
> because the Encoder's internal mutable state will be updated by from multiple 
> threads.
> Here is an example demonstrating the problem:
> {code:java}
> import org.apache.spark.sql._
> val enc = implicitly[Encoder[(Int, Int)]]
> val datasets = (1 to 100).par.map { _ =>
>   val pairs = (1 to 100).map(x => (x, x))
>   spark.createDataset(pairs)(enc)
> }
> datasets.reduce(_ union _).collect().foreach {
>   pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
> }{code}
> Due to the thread-safety issue, the above example results in the creation of 
> corrupted records where different input records' fields are co-mingled.
> This bug is similar to SPARK-22355, a related problem in 
> {{Dataset.collect()}} (fixed in Spark 2.2.1+).
> Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a 
> patch for this shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29419) Seq.toDS / spark.createDataset(Seq) is not thread-safe

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29419:
--
Issue Type: Bug  (was: Improvement)

> Seq.toDS / spark.createDataset(Seq) is not thread-safe
> --
>
> Key: SPARK-29419
> URL: https://issues.apache.org/jira/browse/SPARK-29419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if 
> the caller-supplied {{Encoder}} is used in multiple threads then 
> {{createDataset}}'s usage of the encoder may lead to incorrect answers 
> because the Encoder's internal mutable state will be updated by from multiple 
> threads.
> Here is an example demonstrating the problem:
> {code:java}
> import org.apache.spark.sql._
> val enc = implicitly[Encoder[(Int, Int)]]
> val datasets = (1 to 100).par.map { _ =>
>   val pairs = (1 to 100).map(x => (x, x))
>   spark.createDataset(pairs)(enc)
> }
> datasets.reduce(_ union _).collect().foreach {
>   pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
> }{code}
> Due to the thread-safety issue, the above example results in the creation of 
> corrupted records where different input records' fields are co-mingled.
> This bug is similar to SPARK-22355, a related problem in 
> {{Dataset.collect()}} (fixed in Spark 2.2.1+).
> Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a 
> patch for this shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29492:
--
Affects Version/s: (was: 2.4.0)

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30538) A not very elegant way to control ouput small file

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30538:
--
Affects Version/s: (was: 2.4.0)

> A  not very elegant way to control ouput small file 
> 
>
> Key: SPARK-30538
> URL: https://issues.apache.org/jira/browse/SPARK-30538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29995:
--
Issue Type: Bug  (was: Improvement)

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29799) Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29799:
--
Affects Version/s: (was: 2.4.3)
   (was: 2.1.0)
   3.0.0

> Split a kafka partition into multiple KafkaRDD partitions in the kafka 
> external plugin for Spark Streaming
> --
>
> Key: SPARK-29799
> URL: https://issues.apache.org/jira/browse/SPARK-29799
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: zengrui
>Priority: Major
> Attachments: 0001-add-implementation-for-issue-SPARK-29799.patch
>
>
> When we use Spark Streaming to consume records from kafka, the generated 
> KafkaRDD‘s partition number is equal to kafka topic's partition number, so we 
> can not use more cpu cores to execute the streaming task except we change the 
> topic's partition number,but we can not increase the topic's partition number 
> infinitely.
> Now I think we can split a kafka partition into multiple KafkaRDD partitions, 
> and we can config
> it, then we can use more cpu cores to execute the streaming task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28817) Support standard Javadoc packaging to allow automatic javadoc location settings

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28817:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Support standard Javadoc packaging to allow automatic javadoc location 
> settings
> ---
>
> Key: SPARK-28817
> URL: https://issues.apache.org/jira/browse/SPARK-28817
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Sure, if you can figure out how to enable deploying of 
> javadoc artifacts with the release, that'd do it. I think they're already 
> built of course, just not part of what gets pushed. From a quick look at the 
> build, I don't see what if anything would disable those artifacts, and I see 
> them built locally. The copy script also seems to take all .jar files. It's a 
> legit enhancement just not something I immediately see how to fix.
>Reporter: Marc Le Bihan
>Priority: Minor
>
> Currently Spark javadoc is only accessible here : 
> [http://spark.apache.org/docs/latest/api/java]
> Maven isn't able to find it, through a _maven dependency:resolve 
> -Dclassifier=javadoc_ and Eclipse, for example, isn't able to set javadoc 
> location automatically to good location.
> Therefore, each developper
> in each sub-project of his project that uses Spark,
> must edit each jar related to spark (about ten)
> and set manually the http location of the javadoc.
> Now 99% of the API available on Maven respect the standard of delivering a 
> separate downloadable javadoc through the javadoc classifer.
> Can Spark respect this standard too ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28452) CSV datasource writer do not support maxCharsPerColumn option

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28452:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> CSV datasource writer do not support maxCharsPerColumn option
> -
>
> Key: SPARK-28452
> URL: https://issues.apache.org/jira/browse/SPARK-28452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Minor
>
> CSV datasource reader support maxCharsPerColumn option, but CSV datasource 
> writer do not support maxCharsPerColumn option.
> Should we make CSV datasource writer also support maxCharsPerColumn ? So that 
> reader/writer will have consistent behavior on this option. Otherwise user 
> may write a DF to csv successfully but then load it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29952) Pandas UDFs do not support vectors as input

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29952:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Pandas UDFs do not support vectors as input
> ---
>
> Key: SPARK-29952
> URL: https://issues.apache.org/jira/browse/SPARK-29952
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: koba
>Priority: Minor
>
> Currently, pandas udfs do not support columns of vectors as input. Only 
> columns of arrays. This means that feature columns that contain Dense- or 
> Sparse vectors generated by CountVectorizer for example are not supported by 
> pandas udfs out of the box. One needs to convert vectors into arrays first. 
> It was not documented anywhere and I had to find out by trial and error. 
> Below is an example. 
>  
> {code:java}
> from pyspark.sql.functions import udf, pandas_udf
> import pyspark.sql.functions as F
> from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
> from pyspark.sql.types import *
> import numpy as np
> columns = ['features','id']
> vals = [
>  (DenseVector([1, 2, 1, 3]),1),
>  (DenseVector([2, 2, 1, 3]),2)
> ]
> sdf = spark.createDataFrame(vals,columns)
> sdf.show()
> +-+---+
> | features| id|
> +-+---+
> |[1.0,2.0,1.0,3.0]|  1|
> |[2.0,2.0,1.0,3.0]|  2|
> +-+---+
> {code}
> {code:java}
> @udf(returnType=ArrayType(FloatType()))
> def vector_to_array(v):
> # convert column of vectors into column of arrays
> a = v.values.tolist()
> return a
> sdf = sdf.withColumn('features_array',vector_to_array('features'))
> sdf.show()
> sdf.dtypes
> +-+---++
> | features| id|  features_array|
> +-+---++
> |[1.0,2.0,1.0,3.0]|  1|[1.0, 2.0, 1.0, 3.0]|
> |[2.0,2.0,1.0,3.0]|  2|[2.0, 2.0, 1.0, 3.0]|
> +-+---++
> [('features', 'vector'), ('id', 'bigint'), ('features_array', 'array')]
> {code}
> {code:java}
> import pandas as pd
> @pandas_udf(LongType())
> def _pandas_udf(v):
> res = []
> for i in v:
> res.append(i.mean())
> return pd.Series(res)
> sdf.select(_pandas_udf('features_array')).show()
> +---+
> |_pandas_udf(features_array)|
> +---+
> |  1|
> |  2|
> +---+
> {code}
> But If I use the vector column I get the following error.
> {code:java}
> sdf.select(_pandas_udf('features')).show()
> ---
> Py4JJavaError Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 sdf.select(_pandas_udf('features')).show()
> ~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/dataframe.py
>  in show(self, n, truncate, vertical)
> 376 """
> 377 if isinstance(truncate, bool) and truncate:
> --> 378 print(self._jdf.showString(n, 20, vertical))
> 379 else:
> 380 print(self._jdf.showString(n, int(truncate), vertical))
> ~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1255 answer = self.gateway_client.send_command(command)
>1256 return_value = get_return_value(
> -> 1257 answer, self.gateway_client, self.target_id, self.name)
>1258 
>1259 for temp_arg in temp_args:
> ~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/utils.py
>  in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> ~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 326 raise Py4JJavaError(
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> 330 raise Py4JError(
> Py4JJavaError: An error occurred while calling o2635.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 156.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 156.0 (TID 606, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unsupported data type: 
> struct,values:array>
>   at 
> 

[jira] [Updated] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30669:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Introduce AdmissionControl API to Structured Streaming
> --
>
> Key: SPARK-30669
> URL: https://issues.apache.org/jira/browse/SPARK-30669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> In Structured Streaming, we have the concept of Triggers. With a trigger like 
> Trigger.Once(), the semantics are to process all the data available to the 
> datasource in a single micro-batch. However, this semantic can be broken when 
> data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate 
> limit the amount of data read for that micro-batch.
> We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. 
> A ReadLimit defines how much data should be read in the next micro-batch. 
> `SupportsAdmissionControl` specifies that a source can rate limit its ingest 
> into the system. The source can tell the system what the user specified as a 
> read limit, and the system can enforce this limit within each micro-batch or 
> impose it's own limit if the Trigger is Trigger.Once() for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026309#comment-17026309
 ] 

Dongjoon Hyun commented on SPARK-30665:
---

Hi, [~nchammas]. 

For `Improvement`, `Affected Version` should be the version of master branch 
because we don't allow backporting of improvement.

> Remove Pandoc dependency in PySpark setup.py
> 
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30665) Remove Pandoc dependency in PySpark setup.py

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30665:
--
Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)
   3.0.0

> Remove Pandoc dependency in PySpark setup.py
> 
>
> Key: SPARK-30665
> URL: https://issues.apache.org/jira/browse/SPARK-30665
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PyPI now supports Markdown project descriptions, so we no longer need to 
> convert the Spark README into ReStructuredText and thus no longer need 
> pypandoc.
> Removing pypandoc has the added benefit of eliminating the failure mode 
> described in [this PR|https://github.com/apache/spark/pull/18981].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2020-01-29 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026307#comment-17026307
 ] 

Jungtaek Lim commented on SPARK-28594:
--

While I commented some tasks for improvement, technically this issue is 
resolved as all sub-tasks are resolved. Perhaps I may file new JIRA issues for 
these items, as they're not so many - no bothering for all of us.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2020-01-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-28594.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29367) pandas udf not working with latest pyarrow release (0.15.0)

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026292#comment-17026292
 ] 

Dongjoon Hyun commented on SPARK-29367:
---

Adjusted document is added to `branch-2.4` via 
[https://github.com/apache/spark/pull/27383] .

> pandas udf not working with latest pyarrow release (0.15.0)
> ---
>
> Key: SPARK-29367
> URL: https://issues.apache.org/jira/browse/SPARK-29367
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0, 2.4.1, 2.4.3
>Reporter: Julien Peloton
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Hi,
> I recently upgraded pyarrow from 0.14 to 0.15 (released on Oct 5th), and my 
> pyspark jobs using pandas udf are failing with 
> java.lang.IllegalArgumentException (tested with Spark 2.4.0, 2.4.1, and 
> 2.4.3). Here is a full example to reproduce the failure with pyarrow 0.15:
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import BooleanType
> import pandas as pd
> @pandas_udf(BooleanType(), PandasUDFType.SCALAR)
> def qualitycuts(nbad: int, rb: float, magdiff: float) -> pd.Series:
> """ Apply simple quality cuts
> Returns
> --
> out: pandas.Series of booleans
> Return a Pandas DataFrame with the appropriate flag: false for bad alert,
> and true for good alert.
> """
> mask = nbad.values == 0
> mask *= rb.values >= 0.55
> mask *= abs(magdiff.values) <= 0.1
> return pd.Series(mask)
> spark = SparkSession.builder.getOrCreate()
> # Create dummy DF
> colnames = ["nbad", "rb", "magdiff"]
> df = spark.sparkContext.parallelize(
> zip(
> [0, 1, 0, 0],
> [0.01, 0.02, 0.6, 0.01],
> [0.02, 0.05, 0.1, 0.01]
> )
> ).toDF(colnames)
> df.show()
> # Apply cuts
> df = df\
> .withColumn("toKeep", qualitycuts(*colnames))\
> .filter("toKeep == true")\
> .drop("toKeep")
> # This will fail if latest pyarrow 0.15.0 is used
> df.show()
> {code}
> and the log is:
> {code}
> Driver stacktrace:
> 19/10/07 09:37:49 INFO DAGScheduler: Job 3 failed: showString at 
> NativeMethodAccessorImpl.java:0, took 0.660523 s
> Traceback (most recent call last):
>   File 
> "/Users/julien/Documents/workspace/myrepos/fink-broker/test_pyarrow.py", line 
> 44, in 
> df.show()
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 378, in show
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o64.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 5, localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:98)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89)

[jira] [Updated] (SPARK-29367) pandas udf not working with latest pyarrow release (0.15.0)

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29367:
--
Fix Version/s: 2.4.5

> pandas udf not working with latest pyarrow release (0.15.0)
> ---
>
> Key: SPARK-29367
> URL: https://issues.apache.org/jira/browse/SPARK-29367
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0, 2.4.1, 2.4.3
>Reporter: Julien Peloton
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Hi,
> I recently upgraded pyarrow from 0.14 to 0.15 (released on Oct 5th), and my 
> pyspark jobs using pandas udf are failing with 
> java.lang.IllegalArgumentException (tested with Spark 2.4.0, 2.4.1, and 
> 2.4.3). Here is a full example to reproduce the failure with pyarrow 0.15:
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import BooleanType
> import pandas as pd
> @pandas_udf(BooleanType(), PandasUDFType.SCALAR)
> def qualitycuts(nbad: int, rb: float, magdiff: float) -> pd.Series:
> """ Apply simple quality cuts
> Returns
> --
> out: pandas.Series of booleans
> Return a Pandas DataFrame with the appropriate flag: false for bad alert,
> and true for good alert.
> """
> mask = nbad.values == 0
> mask *= rb.values >= 0.55
> mask *= abs(magdiff.values) <= 0.1
> return pd.Series(mask)
> spark = SparkSession.builder.getOrCreate()
> # Create dummy DF
> colnames = ["nbad", "rb", "magdiff"]
> df = spark.sparkContext.parallelize(
> zip(
> [0, 1, 0, 0],
> [0.01, 0.02, 0.6, 0.01],
> [0.02, 0.05, 0.1, 0.01]
> )
> ).toDF(colnames)
> df.show()
> # Apply cuts
> df = df\
> .withColumn("toKeep", qualitycuts(*colnames))\
> .filter("toKeep == true")\
> .drop("toKeep")
> # This will fail if latest pyarrow 0.15.0 is used
> df.show()
> {code}
> and the log is:
> {code}
> Driver stacktrace:
> 19/10/07 09:37:49 INFO DAGScheduler: Job 3 failed: showString at 
> NativeMethodAccessorImpl.java:0, took 0.660523 s
> Traceback (most recent call last):
>   File 
> "/Users/julien/Documents/workspace/myrepos/fink-broker/test_pyarrow.py", line 
> 44, in 
> df.show()
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 378, in show
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o64.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 5, localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:98)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
>   

[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-29 Thread Foster Langbein (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026287#comment-17026287
 ] 

Foster Langbein commented on SPARK-12312:
-

Yes we followed the same approach as referred to by nabacg too, wrapping the ms 
jdbc driver as explained at 
[https://datamountaineer.com/2016/01/15/spark-jdbc-sql-server-kerberos/].

Tends to be a bit brittle as ms updates the driver

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly

2020-01-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30310:
--
Fix Version/s: 2.4.5

> SparkUncaughtExceptionHandler halts running process unexpectedly
> 
>
> Key: SPARK-30310
> URL: https://issues.apache.org/jira/browse/SPARK-30310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Tin Hang To
>Assignee: Tin Hang To
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> During 2.4.x testing, we have many occasions where the Worker process would 
> just DEAD unexpectedly, with the Worker log ends with:
>  
> {{ERROR SparkUncaughtExceptionHandler: scala.MatchError:  <...callstack...>}}
>  
> We get the same callstack during our 2.3.x testing but the Worker process 
> stays up.
> Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the 
> 2.3.x version,  we found out SPARK-24294 introduced the following change:
> {{exception catch {}}
> {{  case _: OutOfMemoryError =>}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] 
> =>}}
> {{    // SPARK-24294: This is defensive code, in case that 
> SparkFatalException is}}
> {{    // misused and uncaught.}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case _ if exitOnUncaughtException =>}}
> {{    System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}}
> {{}}}
>  
> This code has the _ if exitOnUncaughtException case, but not the other _ 
> cases.  As a result, when exitOnUncaughtException is false (Master and 
> Worker) and exception doesn't match any of the match cases (e.g., 
> IllegalStateException), Scala throws MatchError(exception) ("MatchError" 
> wrapper of the original exception).  Then the other catch block down below 
> thinks we have another uncaught exception, and halts the entire process with 
> SparkExitCode.UNCAUGHT_EXCEPTION_TWICE.
>  
> {{catch {}}
> {{  case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}}
> {{  case t: Throwable => 
> Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}}
> {{}}}
>  
> Therefore, even when exitOnUncaughtException is false, the process will halt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026286#comment-17026286
 ] 

Dongjoon Hyun commented on SPARK-30310:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/27384

> SparkUncaughtExceptionHandler halts running process unexpectedly
> 
>
> Key: SPARK-30310
> URL: https://issues.apache.org/jira/browse/SPARK-30310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Tin Hang To
>Assignee: Tin Hang To
>Priority: Major
> Fix For: 3.0.0
>
>
> During 2.4.x testing, we have many occasions where the Worker process would 
> just DEAD unexpectedly, with the Worker log ends with:
>  
> {{ERROR SparkUncaughtExceptionHandler: scala.MatchError:  <...callstack...>}}
>  
> We get the same callstack during our 2.3.x testing but the Worker process 
> stays up.
> Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the 
> 2.3.x version,  we found out SPARK-24294 introduced the following change:
> {{exception catch {}}
> {{  case _: OutOfMemoryError =>}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] 
> =>}}
> {{    // SPARK-24294: This is defensive code, in case that 
> SparkFatalException is}}
> {{    // misused and uncaught.}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case _ if exitOnUncaughtException =>}}
> {{    System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}}
> {{}}}
>  
> This code has the _ if exitOnUncaughtException case, but not the other _ 
> cases.  As a result, when exitOnUncaughtException is false (Master and 
> Worker) and exception doesn't match any of the match cases (e.g., 
> IllegalStateException), Scala throws MatchError(exception) ("MatchError" 
> wrapper of the original exception).  Then the other catch block down below 
> thinks we have another uncaught exception, and halts the entire process with 
> SparkExitCode.UNCAUGHT_EXCEPTION_TWICE.
>  
> {{catch {}}
> {{  case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}}
> {{  case t: Throwable => 
> Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}}
> {{}}}
>  
> Therefore, even when exitOnUncaughtException is false, the process will halt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30671) SparkSession emptyDataFrame should not create an RDD

2020-01-29 Thread Jira
Herman van Hövell created SPARK-30671:
-

 Summary: SparkSession emptyDataFrame should not create an RDD
 Key: SPARK-30671
 URL: https://issues.apache.org/jira/browse/SPARK-30671
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Herman van Hövell
Assignee: Herman van Hövell


SparkSession.emptyDataFrame uses an empty RDD in the background. This is a bit 
of a pity because the optimizer can't recognize this as being empty and it 
won't apply things like {{PropagateEmptyRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29543) Support Structured Streaming UI

2020-01-29 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-29543.
--
Fix Version/s: 3.0.0
 Assignee: Genmao Yu
   Resolution: Done

> Support Structured Streaming UI
> ---
>
> Key: SPARK-29543
> URL: https://issues.apache.org/jira/browse/SPARK-29543
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Major
> Fix For: 3.0.0
>
>
> Open this jira to support structured streaming UI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-29 Thread John Lonergan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026245#comment-17026245
 ] 

John Lonergan commented on SPARK-12312:
---

@nabacg funnily enough we came up with almost the same solution but for oracle 
which so has this issue.


Btw oracle is hideous as it tramples all over the static jaas config, wiping it 
out, which btw msoft also used to do but has fixed recently.
At least msoft open sourced the jdbc driver which surely encourages healthy 
debate and fixes. Come on oracle 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2020-01-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026240#comment-17026240
 ] 

Ismaël Mejía commented on SPARK-27733:
--

Oh I was not aware that Spark will rely still on 1.2.1 argh! then it will be 
tough. My idea was to get HIVE-21737 fixed and eventually backported into 
version 2.3.7 and then into Spark 3 or maybe 3.1.

So we are in a really blocking situation then. Any ideas of how this could be 
unblocked [~dongjoon] ? It is a pity in particular because Avro 1.9.x has lots 
of improvements that would be worth to catch.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-29 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30512:
--
Fix Version/s: 2.4.5

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-29 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-30512:
-

Assignee: Chandni Singh

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-29 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30512.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

this could be pulled back into branch-2.X as well

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2020-01-29 Thread Aaron Steers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026184#comment-17026184
 ] 

Aaron Steers commented on SPARK-16452:
--

Hello, everyone. I would like to revitalize this thread. Is there any plan to 
add support for INFORMATION_SCHEMA, which is part of SQL92?

Any possibility of revitalizing SPARK-16492 or implementing in another way? 
That PR (marked stale/closed in 2016) references a possible future dependency 
on something called "Catalog Federation" and goal to pick this up again after 
that was delivered - do we know if that federation has been implemented or is 
it still in the plan, and if not, is there another a viable path forward?

I think this is a hugely important feature for Spark and I'm willing to help 
contribute if I could get direction on a possible path forward.

 

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026147#comment-17026147
 ] 

Dongjoon Hyun commented on SPARK-28556:
---

That sounds great!

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30670) Pipes for PySpark

2020-01-29 Thread Vincent (Jira)
Vincent created SPARK-30670:
---

 Summary: Pipes for PySpark
 Key: SPARK-30670
 URL: https://issues.apache.org/jira/browse/SPARK-30670
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Vincent


I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
functional programming pattern that is inspired from the tidyverse that is 
currently missing. The pandas community also recently adopted this pattern, 
documented [here]([https://tomaugspurger.github.io/method-chaining.html).]

This is the idea. Suppose you had this;


{code:java}
# file that has [user, date, timestamp, eventtype]
ddf = spark.read.parquet("")

w_user = Window().partitionBy("user")
w_user_date = Window().partitionBy("user", "date")
w_user_time = Window().partitionBy("user").sortBy("timestamp")

thres_sesstime = 60 * 15 
min_n_rows = 10
min_n_sessions = 5

clean_ddf = (ddf
  .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
  .withColumn("new_session", (sf.col("delta") > thres_sesstime).cast("integer"))
  .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
  .drop("new_session")
  .drop("delta")
  .withColumn("nrow_user", sf.count(sf.col("timestamp")))
  .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
  .filter(sf.col("nrow_user") > min_n_rows)
  .filter(sf.col("nrow_user_date") > min_n_sessions)
  .drop("nrow_user")
  .drop("nrow_user_date"))
{code}
The code works and it is somewhat clear. We add a session to the dataframe and 
then we use this to remove outliers. The issue is that this chain of commands 
can get quite long so instead it might be better to turn this into functions.
{code:java}
def add_session(dataf, session_threshold=60*15):
w_user = Window().partitionBy("user")
  
return (dataf  
.withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
.withColumn("new_session", (sf.col("delta") > 
threshold_sesstime).cast("integer"))
.withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
.drop("new_session")
.drop("delta"))

def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
w_user_date = Window().partitionBy("user", "date")
w_user_time = Window().partitionBy("user").sortBy("timestamp")

return (dataf  
.withColumn("nrow_user", sf.count(sf.col("timestamp")))
.withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
.filter(sf.col("nrow_user") > min_n_rows)
.filter(sf.col("nrow_user_date") > min_n_sessions)
.drop("nrow_user")
.drop("nrow_user_date"))
{code}
The issue lies not in these functions. These functions are great! You can unit 
test them and they really give nice verbs that function as an abstraction. The 
issue is in how you now need to apply them. 
{code:java}
remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
{code}
It'd be much nicer to perhaps allow for this;
{code:java}
(ddf
  .pipe(add_session, session_threshold=900)
  .pipe(remove_outliers, min_n_rows=11))
{code}
The cool thing about this is that you can really easily allow for method 
chaining but also that you have an amazing way to split high level code and low 
level code. You still allow mutation as a high level by exposing keyword 
arguments but you can easily find the lower level code in debugging because 
you've contained details to their functions.

For code maintenance, I've relied on this pattern a lot personally. But sofar, 
I've always monkey-patched spark to be able to do this.
{code:java}
from pyspark.sql import DataFrame 

def pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs)
{code}
Could I perhaps add these few lines of code to the codebase?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-01-29 Thread Tyson Condie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026134#comment-17026134
 ] 

Tyson Condie commented on SPARK-30602:
--

 The design looks to bring in some good optimizations from prior works like 
Riffle. I've also seen that your performance numbers look great. It would be 
good to explore adding the necessary extension points such that shuffle 
implementations like this one can be easily incorporated into future Spark 
versions. Looking forward to following the development of this SPIP!

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2020-01-29 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026098#comment-17026098
 ] 

Shixiong Zhu commented on SPARK-28556:
--

This also reminds me that we should also review all public APIs to see if there 
is any similar issue, since 3.0.0 is a good chance to fix API bugs.

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30662) ALS/MLP extend HasBlockSize

2020-01-29 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026095#comment-17026095
 ] 

Huaxin Gao commented on SPARK-30662:


I will work on this.

> ALS/MLP extend HasBlockSize
> ---
>
> Key: SPARK-30662
> URL: https://issues.apache.org/jira/browse/SPARK-30662
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026079#comment-17026079
 ] 

Dongjoon Hyun commented on SPARK-27733:
---

[~iemejia]. I'm wondering why you think like that? For me, it looks opposite.
{quote}Upgrade in the Spark side should be relatively straight-forward, but I 
am not sure if the HIve upgrade could be easily aligned. For ref HIVE-21737
{quote}
Apache Spark 3.0 will ship both Hive 1.2.1 built-in and Hive 2.3.6 built-in. 
 For HIVE-21737, Hive community may consider Avro 1.9.1 with only Hive 3.x, but 
here Apache Spark community need to consider them with both Hive 1.2.1 and Hive 
2.3.6.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2020-01-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026068#comment-17026068
 ] 

Dongjoon Hyun commented on SPARK-28556:
---

Got it. Thanks for the confirmation, [~zsxwing].

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2020-01-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026043#comment-17026043
 ] 

Ismaël Mejía commented on SPARK-27733:
--

Upgrade in the Spark side should be relatively straight-forward, but I am not 
sure if the HIve upgrade could be easily aligned. For ref HIVE-21737

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.1

2020-01-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026041#comment-17026041
 ] 

Ismaël Mejía commented on SPARK-26346:
--

Since Parquet depends on Avro 1.9.1 shouldn't SPARK-27733 be done first?

> Upgrade parquet to 1.11.1
> -
>
> Key: SPARK-26346
> URL: https://issues.apache.org/jira/browse/SPARK-26346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30582) Spark UI is not showing Aggregated Metrics by Executor in stage page

2020-01-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30582.
--
Resolution: Fixed

Issue resolved by pull request 27292
[https://github.com/apache/spark/pull/27292]

> Spark UI is not showing Aggregated Metrics by Executor in stage page
> 
>
> Key: SPARK-30582
> URL: https://issues.apache.org/jira/browse/SPARK-30582
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SparkUIStagePage.mov
>
>
> There are scenarios where Spark History Server is located behind the VPC. So 
> whenever api calls hit to get the executor Summary(allexecutors). There can 
> be delay in getting the response of executor summary and in mean time 
> "stage-page-template.html" is loaded and the response of executor Summary is 
> not added to the stage-page-template.html.
> As the result of which Aggregated Metrics by Executor in stage page is 
> showing blank.
> This scenario can be easily found in the cases when there is some 
> proxy-server which is responsible for sending the request and response to 
> spark History server.
>  This can be reproduced in Knox/In-house proxy servers which are used to send 
> and receive response to Spark History Server.
> Alternative scenario to test this case, Open the spark UI in developer mode 
> in browser add some breakpoint in stagepage.js, this will add some delay in 
> getting the response and now if we check the spark UI for stage Aggregated 
> Metrics by Executor in stage page is showing blank.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30582) Spark UI is not showing Aggregated Metrics by Executor in stage page

2020-01-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30582:


Assignee: Saurabh Chawla

> Spark UI is not showing Aggregated Metrics by Executor in stage page
> 
>
> Key: SPARK-30582
> URL: https://issues.apache.org/jira/browse/SPARK-30582
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SparkUIStagePage.mov
>
>
> There are scenarios where Spark History Server is located behind the VPC. So 
> whenever api calls hit to get the executor Summary(allexecutors). There can 
> be delay in getting the response of executor summary and in mean time 
> "stage-page-template.html" is loaded and the response of executor Summary is 
> not added to the stage-page-template.html.
> As the result of which Aggregated Metrics by Executor in stage page is 
> showing blank.
> This scenario can be easily found in the cases when there is some 
> proxy-server which is responsible for sending the request and response to 
> spark History server.
>  This can be reproduced in Knox/In-house proxy servers which are used to send 
> and receive response to Spark History Server.
> Alternative scenario to test this case, Open the spark UI in developer mode 
> in browser add some breakpoint in stagepage.js, this will add some delay in 
> getting the response and now if we check the spark UI for stage Aggregated 
> Metrics by Executor in stage page is showing blank.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30582) Spark UI is not showing Aggregated Metrics by Executor in stage page

2020-01-29 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025944#comment-17025944
 ] 

Sean R. Owen commented on SPARK-30582:
--

[~saurabhc100] it's OK now as it's resolved, but generally do not set Target or 
Fix version.

> Spark UI is not showing Aggregated Metrics by Executor in stage page
> 
>
> Key: SPARK-30582
> URL: https://issues.apache.org/jira/browse/SPARK-30582
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SparkUIStagePage.mov
>
>
> There are scenarios where Spark History Server is located behind the VPC. So 
> whenever api calls hit to get the executor Summary(allexecutors). There can 
> be delay in getting the response of executor summary and in mean time 
> "stage-page-template.html" is loaded and the response of executor Summary is 
> not added to the stage-page-template.html.
> As the result of which Aggregated Metrics by Executor in stage page is 
> showing blank.
> This scenario can be easily found in the cases when there is some 
> proxy-server which is responsible for sending the request and response to 
> spark History server.
>  This can be reproduced in Knox/In-house proxy servers which are used to send 
> and receive response to Spark History Server.
> Alternative scenario to test this case, Open the spark UI in developer mode 
> in browser add some breakpoint in stagepage.js, this will add some delay in 
> getting the response and now if we check the spark UI for stage Aggregated 
> Metrics by Executor in stage page is showing blank.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30582) Spark UI is not showing Aggregated Metrics by Executor in stage page

2020-01-29 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30582:
-
Priority: Minor  (was: Major)

> Spark UI is not showing Aggregated Metrics by Executor in stage page
> 
>
> Key: SPARK-30582
> URL: https://issues.apache.org/jira/browse/SPARK-30582
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SparkUIStagePage.mov
>
>
> There are scenarios where Spark History Server is located behind the VPC. So 
> whenever api calls hit to get the executor Summary(allexecutors). There can 
> be delay in getting the response of executor summary and in mean time 
> "stage-page-template.html" is loaded and the response of executor Summary is 
> not added to the stage-page-template.html.
> As the result of which Aggregated Metrics by Executor in stage page is 
> showing blank.
> This scenario can be easily found in the cases when there is some 
> proxy-server which is responsible for sending the request and response to 
> spark History server.
>  This can be reproduced in Knox/In-house proxy servers which are used to send 
> and receive response to Spark History Server.
> Alternative scenario to test this case, Open the spark UI in developer mode 
> in browser add some breakpoint in stagepage.js, this will add some delay in 
> getting the response and now if we check the spark UI for stage Aggregated 
> Metrics by Executor in stage page is showing blank.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-29 Thread nabacg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025860#comment-17025860
 ] 

nabacg commented on SPARK-12312:


Sure, will do [~gsomogyi]

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-29 Thread nabacg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025254#comment-17025254
 ] 

nabacg edited comment on SPARK-12312 at 1/29/20 1:13 PM:
-

My suggestion was for people who need a working solution now and can't wait 
till there is a new Spark release out and/or can't easily upgrade their cluster 
installations (which happens in corporate multi-tenant situations, I've been 
there.. ).  

My approach avoids those problems by patching the JDBC driver, instead of 
Spark. It's not a long term solution, but perhaps will save someone's skin. I 
certainly worked for me and one of my clients. 


was (Author: nabacg):
My suggestion was for people who need a working solution now and can't wait 
till there is a new Spark release out and/or can't easily upgrade their cluster 
installations (which happens in corporate multi-tenant situations, I've been 
there.. ).  

My approach avoids those problems by patching the JDBC driver, instead of 
Spark. It's not a long term solution, but perhaps will save someone some skin. 
I certainly worked for me and one of my clients. 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025822#comment-17025822
 ] 

Gabor Somogyi commented on SPARK-12312:
---

[~nabacg] thanks for sharin, pretty sure there are peoples who keen on using it.
My intention is to provide something for long term, since you have quite an 
experience in this area happy to hear you opinion when I've created the PR (and 
of course if you see gaps which can be filled just share).

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-29 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025729#comment-17025729
 ] 

Maxim Gekk commented on SPARK-30668:


We can try to revert this [https://github.com/apache/spark/pull/23495] 

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025698#comment-17025698
 ] 

Herman van Hövell commented on SPARK-30668:
---

I don't think we should revert the proleptic gregorian patch, the previous 
behavior was kind of broken.

[~maxgekk] can we move back to the previous behavior by using the old parser? 
And perhaps feature flag that bit, or make it configurable.

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-29 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025681#comment-17025681
 ] 

Maxim Gekk edited comment on SPARK-30668 at 1/29/20 8:43 AM:
-

[~marmbrus] The doc for `to_timestamp` points out DateTimeFormatter: 
https://github.com/apache/spark/blob/d69ed9afdf2bd8d03aaf835292b92692ec8189e9/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2964,
 please, have a look at its doc 
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
 to see which patterns for time zone and zone offset it supports /cc [~srowen] 
[~dongjoon]


was (Author: maxgekk):
If [~marmbrus] develops something new, he could use correct pattern for zone 
offsets as it is pointed out in the Java docs: 
https://github.com/apache/spark/blob/d69ed9afdf2bd8d03aaf835292b92692ec8189e9/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2964
 /cc [~srowen] [~dongjoon]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-29 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025681#comment-17025681
 ] 

Maxim Gekk commented on SPARK-30668:


If [~marmbrus] develops something new, he could use correct pattern for zone 
offsets as it is pointed out in the Java docs: 
https://github.com/apache/spark/blob/d69ed9afdf2bd8d03aaf835292b92692ec8189e9/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2964
 /cc [~srowen] [~dongjoon]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-01-29 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025673#comment-17025673
 ] 

Xiao Li commented on SPARK-30668:
-

[~hvanhovell] Making it configurable looks necessary. Today, Michael hit this 
when they tried the master branch. 

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org