[jira] [Updated] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21869:
-
Fix Version/s: (was: 3.0.0)

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992975#comment-16992975
 ] 

Shixiong Zhu commented on SPARK-21869:
--

Reopened this. https://github.com/apache/spark/pull/25853 has been reverted.

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2019-12-10 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-21869:
--

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30208) A race condition when reading from Kafka in PySpark

2019-12-10 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-30208:


 Summary: A race condition when reading from Kafka in PySpark
 Key: SPARK-30208
 URL: https://issues.apache.org/jira/browse/SPARK-30208
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Jiawen Zhu


When using PySpark to read from Kafka, there is a race condition that Spark may 
use KafkaConsumer in multiple threads at the same time and throw the following 
error:

{code}
java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:2215)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2104)
at 
kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2059)
at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.close(KafkaDataConsumer.scala:451)
at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$NonCachedKafkaDataConsumer.release(KafkaDataConsumer.scala:508)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.close(KafkaSourceRDD.scala:126)
at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:66)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:131)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anonfun$compute$3.apply(KafkaSourceRDD.scala:130)
at 
org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:162)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:131)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:144)
at 
org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:142)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:142)
at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:130)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:155)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

When using PySpark, reading from Kafka is actually happening in a separate 
writer thread rather that the task thread.  When a task is early terminated 
(e.g., there is a limit operator), the task thread may stop the KafkaConsumer 
when the writer thread is using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29953) File stream source cleanup options may break a file sink output

2019-12-05 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-29953.
--
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> File stream source cleanup options may break a file sink output
> ---
>
> Key: SPARK-29953
> URL: https://issues.apache.org/jira/browse/SPARK-29953
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-20568 added options to file streaming source to clean up processed 
> files. However, when applying these options to a directory that was written 
> by a file streaming sink, it will make the directory not queryable any more 
> because we delete files from the directory but they are still tracked by file 
> sink logs.
> I think we should block the options if the input source is a file streaming 
> sink path (has "_spark_metadata" folder).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29953) File stream source cleanup options may break a file sink output

2019-11-18 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-29953:
-
Description: 
SPARK-20568 added options to file streaming source to clean up processed files. 
However, when applying these options to a directory that was written by a file 
streaming sink, it will make the directory not queryable any more because we 
delete files from the directory but they are still tracked by file sink logs.

I think we should block the options if the input source is a file streaming 
sink path (has "_spark_metadata" folder).

  was:
SPARK-20568 added options to file streaming source to clean up processed files. 
However, when applying these options to a directory that was written by a file 
streaming sink, it will make the directory not queryable any more.

I think we should block the options if the input source is a file streaming 
sink path (has "_spark_metadata" folder).


> File stream source cleanup options may break a file sink output
> ---
>
> Key: SPARK-29953
> URL: https://issues.apache.org/jira/browse/SPARK-29953
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> SPARK-20568 added options to file streaming source to clean up processed 
> files. However, when applying these options to a directory that was written 
> by a file streaming sink, it will make the directory not queryable any more 
> because we delete files from the directory but they are still tracked by file 
> sink logs.
> I think we should block the options if the input source is a file streaming 
> sink path (has "_spark_metadata" folder).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29953) File stream source cleanup options may break a file sink output

2019-11-18 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976971#comment-16976971
 ] 

Shixiong Zhu commented on SPARK-29953:
--

cc [~kabhwan]

> File stream source cleanup options may break a file sink output
> ---
>
> Key: SPARK-29953
> URL: https://issues.apache.org/jira/browse/SPARK-29953
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> SPARK-20568 added options to file streaming source to clean up processed 
> files. However, when applying these options to a directory that was written 
> by a file streaming sink, it will make the directory not queryable any more.
> I think we should block the options if the input source is a file streaming 
> sink path (has "_spark_metadata" folder).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29953) File stream source cleanup options may break a file sink output

2019-11-18 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-29953:


 Summary: File stream source cleanup options may break a file sink 
output
 Key: SPARK-29953
 URL: https://issues.apache.org/jira/browse/SPARK-29953
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Shixiong Zhu


SPARK-20568 added options to file streaming source to clean up processed files. 
However, when applying these options to a directory that was written by a file 
streaming sink, it will make the directory not queryable any more.

I think we should block the options if the input source is a file streaming 
sink path (has "_spark_metadata" folder).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-10-28 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961412#comment-16961412
 ] 

Shixiong Zhu commented on SPARK-28841:
--

[~srowen] Yep. ":" is not a valid char in a HDFS path. But it's valid for some 
other file systems, such as local file system, or S3AFileSystem, and they allow 
the user to create such files. I pointed out the codes that hit this issue in 
HDFS-14762.

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-10-28 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961402#comment-16961402
 ] 

Shixiong Zhu commented on SPARK-28841:
--

[~srowen] "?" is to trigger glob pattern codes. It matches any chars. That's 
not a typo. The code I post here is trying to read files that match the pattern 
"/tmp/test/foo?bar". But if "/tmp/test/foo:bar" exists, it will fail because of 
HDFS-14762.

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27254) Cleanup complete but becoming invalid output files in ManifestFileCommitProtocol if job is aborted

2019-09-27 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27254.
--
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Cleanup complete but becoming invalid output files in 
> ManifestFileCommitProtocol if job is aborted
> --
>
> Key: SPARK-27254
> URL: https://issues.apache.org/jira/browse/SPARK-27254
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> ManifestFileCommitProtocol doesn't clean up complete (but will become 
> invalid) output files when job is aborted.
> ManifestFileCommitProtocol doesn't do anything for cleaning up when job is 
> aborted but just maintains the metadata which list of complete output files 
> are written. SPARK-27210 addressed for task level cleanup, but it still 
> doesn't clean up it as job level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29099) org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set

2019-09-20 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-29099.
--
Resolution: Duplicate

> org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set
> 
>
> Key: SPARK-29099
> URL: https://issues.apache.org/jira/browse/SPARK-29099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Shixiong Zhu
>Priority: Major
>
> I noticed that 
> "org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime" is always 
> 0 in my environment. Looks like Spark never updates this field in metastore 
> when reading a table. This is fine considering the cost to update it when 
> reading a table is high.
> However, "Last Access" in "describe extended" always shows "Thu Jan 01 
> 00:00:00 UTC 1970" and this is confusing. Can we show something alternative 
> to indicate it's not set?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29099) org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set

2019-09-16 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-29099:


 Summary: 
org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set
 Key: SPARK-29099
 URL: https://issues.apache.org/jira/browse/SPARK-29099
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Shixiong Zhu


I noticed that 
"org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime" is always 0 
in my environment. Looks like Spark never updates this field in metastore when 
reading a table. This is fine considering the cost to update it when reading a 
table is high.

However, "Last Access" in "describe extended" always shows "Thu Jan 01 00:00:00 
UTC 1970" and this is confusing. Can we show something alternative to indicate 
it's not set?




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28976) Use KeyLock to simplify MapOutputTracker.getStatuses

2019-09-04 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28976.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> Use KeyLock to simplify MapOutputTracker.getStatuses
> 
>
> Key: SPARK-28976
> URL: https://issues.apache.org/jira/browse/SPARK-28976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28976) Use KeyLock to simplify MapOutputTracker.getStatuses

2019-09-04 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28976:


Assignee: Shixiong Zhu

> Use KeyLock to simplify MapOutputTracker.getStatuses
> 
>
> Key: SPARK-28976
> URL: https://issues.apache.org/jira/browse/SPARK-28976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28976) Use KeyLock to simplify MapOutputTracker.getStatuses

2019-09-04 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-28976:


 Summary: Use KeyLock to simplify MapOutputTracker.getStatuses
 Key: SPARK-28976
 URL: https://issues.apache.org/jira/browse/SPARK-28976
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Shixiong Zhu






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject

2019-09-03 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-3137.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Use finer grained locking in TorrentBroadcast.readObject
> 
>
> Key: SPARK-3137
> URL: https://issues.apache.org/jira/browse/SPARK-3137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>
> TorrentBroadcast.readObject uses a global lock so only one task can be 
> fetching the blocks at the same time.
> This is not optimal if we are running multiple stages concurrently because 
> they should be able to independently fetch their own blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28883) Fix a flaky test: ThriftServerQueryTestSuite

2019-09-03 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921413#comment-16921413
 ] 

Shixiong Zhu commented on SPARK-28883:
--

Marked this is a 3.0.0 blocker. We should either fix the issue and re-enable 
the test or just revert changes before 3.0.0.

> Fix a flaky test: ThriftServerQueryTestSuite
> 
>
> Key: SPARK-28883
> URL: https://issues.apache.org/jira/browse/SPARK-28883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109764/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (2 failures)
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109768/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (4 failures)
> Error message:
> {noformat}
> java.sql.SQLException: Could not open client transport with JDBC Uri: 
> jdbc:hive2://localhost:14431: java.net.ConnectException: Connection refused 
> (Connection refused)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28883) Fix a flaky test: ThriftServerQueryTestSuite

2019-09-03 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28883:
-
Priority: Blocker  (was: Major)

> Fix a flaky test: ThriftServerQueryTestSuite
> 
>
> Key: SPARK-28883
> URL: https://issues.apache.org/jira/browse/SPARK-28883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109764/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (2 failures)
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109768/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (4 failures)
> Error message:
> {noformat}
> java.sql.SQLException: Could not open client transport with JDBC Uri: 
> jdbc:hive2://localhost:14431: java.net.ConnectException: Connection refused 
> (Connection refused)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28883) Fix a flaky test: ThriftServerQueryTestSuite

2019-09-03 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28883:
-
Target Version/s: 3.0.0

> Fix a flaky test: ThriftServerQueryTestSuite
> 
>
> Key: SPARK-28883
> URL: https://issues.apache.org/jira/browse/SPARK-28883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109764/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (2 failures)
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109768/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (4 failures)
> Error message:
> {noformat}
> java.sql.SQLException: Could not open client transport with JDBC Uri: 
> jdbc:hive2://localhost:14431: java.net.ConnectException: Connection refused 
> (Connection refused)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject

2019-08-28 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-3137:
-
  Assignee: Shixiong Zhu

> Use finer grained locking in TorrentBroadcast.readObject
> 
>
> Key: SPARK-3137
> URL: https://issues.apache.org/jira/browse/SPARK-3137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> TorrentBroadcast.readObject uses a global lock so only one task can be 
> fetching the blocks at the same time.
> This is not optimal if we are running multiple stages concurrently because 
> they should be able to independently fetch their own blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-22 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28025:


Assignee: Jungtaek Lim

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-22 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28025.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912719#comment-16912719
 ] 

Shixiong Zhu edited comment on SPARK-28841 at 8/21/19 10:35 PM:


Okey, at least the following codes are legit but failing.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import sys.process._
"touch /tmp/test/foo:bar".!!
spark.read.text("/tmp/test/foo?bar")

// Exiting paste mode, now interpreting.

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: foo:bar
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:245)
  at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:255)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:549)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:711)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:683)
  ... 57 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: foo:bar
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 76 more
{code}


was (Author: zsxwing):
Okey, at least the following codes are legit but failing.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.hadoop.fs._
val fs = new Path("/").getFileSystem(spark.sessionState.newHadoopConf())
fs.create(new Path("/tmp/test/foo:bar")).close()

// Exiting paste mode, now interpreting.

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: .foo:bar.crc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:90)
  at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:402)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
  ... 59 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
.foo:bar.crc
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 69 more
{code}

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

--

[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912719#comment-16912719
 ] 

Shixiong Zhu commented on SPARK-28841:
--

Okey, at least the following codes are legit but failing.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.hadoop.fs._
val fs = new Path("/").getFileSystem(spark.sessionState.newHadoopConf())
fs.create(new Path("/tmp/test/foo:bar")).close()
spark.read.text("/tmp/test/foo?bar")

// Exiting paste mode, now interpreting.

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: .foo:bar.crc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:90)
  at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:402)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
  ... 59 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
.foo:bar.crc
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 69 more
{code}

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912719#comment-16912719
 ] 

Shixiong Zhu edited comment on SPARK-28841 at 8/21/19 10:00 PM:


Okey, at least the following codes are legit but failing.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.hadoop.fs._
val fs = new Path("/").getFileSystem(spark.sessionState.newHadoopConf())
fs.create(new Path("/tmp/test/foo:bar")).close()

// Exiting paste mode, now interpreting.

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: .foo:bar.crc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:90)
  at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:402)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
  ... 59 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
.foo:bar.crc
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 69 more
{code}


was (Author: zsxwing):
Okey, at least the following codes are legit but failing.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.hadoop.fs._
val fs = new Path("/").getFileSystem(spark.sessionState.newHadoopConf())
fs.create(new Path("/tmp/test/foo:bar")).close()
spark.read.text("/tmp/test/foo?bar")

// Exiting paste mode, now interpreting.

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: .foo:bar.crc
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:90)
  at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:402)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
  at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
  ... 59 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
.foo:bar.crc
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 69 more
{code}

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-28841:


 Summary: Spark cannot read a relative path containing ":"
 Key: SPARK-28841
 URL: https://issues.apache.org/jira/browse/SPARK-28841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Reproducer:
{code}
spark.read.parquet("test:test")
{code}

Error:
{code}
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: test:test
{code}

This is actually a Hadoop issue since the error is thrown from "new 
Path("test:test")". I'm creating this ticket to see if we can work around this 
issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912572#comment-16912572
 ] 

Shixiong Zhu commented on SPARK-28841:
--

In a second thought, this is probably a limitation of "Path" API since it 
cannot tell whether "test" is a schema or just a part of a relate path 

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-08-21 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912572#comment-16912572
 ] 

Shixiong Zhu edited comment on SPARK-28841 at 8/21/19 6:04 PM:
---

In a second thought, this is probably a limitation of "Path" API since it 
cannot tell whether "test" is a schema or just a part of a relate path.


was (Author: zsxwing):
In a second thought, this is probably a limitation of "Path" API since it 
cannot tell whether "test" is a schema or just a part of a relate path 

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-20 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911635#comment-16911635
 ] 

Shixiong Zhu commented on SPARK-28605:
--

[~kabhwan] Thanks for pointing it out. Yes, we can close this ticket.

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28605) Performance regression in SS's foreach

2019-08-20 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28605.
--
Resolution: Invalid

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28650) Fix the guarantee of ForeachWriter

2019-08-20 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28650.
--
Fix Version/s: 2.4.5
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Fix the guarantee of ForeachWriter
> --
>
> Key: SPARK-28650
> URL: https://issues.apache.org/jira/browse/SPARK-28650
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.5
>
>
> Right now ForeachWriter has the following guarantee:
> {code}
> If the streaming query is being executed in the micro-batch mode, then every 
> partition
> represented by a unique tuple (partitionId, epochId) is guaranteed to have 
> the same data.
> Hence, (partitionId, epochId) can be used to deduplicate and/or 
> transactionally commit data
> and achieve exactly-once guarantees.
> {code}
>  
> But we can break this easily actually when restarting a query but a batch is 
> re-run (e.g., upgrade Spark)
>  * Source returns a different DataFrame that has a different partition number 
> (e.g., we start to not create empty partitions in Kafka Source V2).
>  * A new added optimization rule may change the number of partitions in the 
> new run.
>  * Change the file split size in the new run.
> Since we cannot guarantee that the same (partitionId, epochId) has the same 
> data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28650) Fix the guarantee of ForeachWriter

2019-08-09 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904085#comment-16904085
 ] 

Shixiong Zhu commented on SPARK-28650:
--

Go ahead. I'm not working on this. For the signature of "open", I don't think 
it's worth to change it since that would break the binary compatibility.

> Fix the guarantee of ForeachWriter
> --
>
> Key: SPARK-28650
> URL: https://issues.apache.org/jira/browse/SPARK-28650
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now ForeachWriter has the following guarantee:
> {code}
> If the streaming query is being executed in the micro-batch mode, then every 
> partition
> represented by a unique tuple (partitionId, epochId) is guaranteed to have 
> the same data.
> Hence, (partitionId, epochId) can be used to deduplicate and/or 
> transactionally commit data
> and achieve exactly-once guarantees.
> {code}
>  
> But we can break this easily actually when restarting a query but a batch is 
> re-run (e.g., upgrade Spark)
>  * Source returns a different DataFrame that has a different partition number 
> (e.g., we start to not create empty partitions in Kafka Source V2).
>  * A new added optimization rule may change the number of partitions in the 
> new run.
>  * Change the file split size in the new run.
> Since we cannot guarantee that the same (partitionId, epochId) has the same 
> data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Docs Text: All fields of the Structured Streaming's file source schema will 
be forced to be nullable since Spark 3.0.0. This protects users from 
corruptions when the specified or inferred schema is not compatible with actual 
data. If you would like the original behavior, you can set the SQL conf 
"spark.sql.streaming.fileSource.schema.forceNullable" to "false". This flag is 
added to reduce the migration work when upgrading to Spark 3.0.0 and will be 
removed in future. Please update your codes to work with the new behavior as 
soon as possible.

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz Magdanski
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28651:


Assignee: Shixiong Zhu

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Labels: release-notes  (was: )

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>  Labels: release-notes
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

It can cause corrupted parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

Tcaused corrupted parquet files due to the schema mismatch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> It can cause corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

Tcaused corrupted parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was reported by Tomasz Magdanski. He found it caused corrupted 
parquet files due to the schema mismatch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> Tcaused corrupted parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Reporter: Tomasz  (was: Shixiong Zhu)

> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Tomasz
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted 
> parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was reported by Tomasz Magdanski. He found it caused corrupted 
parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted 
> parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was rpo


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28651:
-
Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was rpo

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.


> Streaming file source doesn't change the schema to nullable automatically
> -
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was rpo



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

2019-08-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28651:


 Summary: Streaming file source doesn't change the schema to 
nullable automatically
 Key: SPARK-28651
 URL: https://issues.apache.org/jira/browse/SPARK-28651
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26152) Synchronize Worker Cleanup with Worker Shutdown

2019-08-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902477#comment-16902477
 ] 

Shixiong Zhu edited comment on SPARK-26152 at 8/7/19 9:07 PM:
--

[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers an unexpected executor shut down. Should we try to figure 
out why OOM happens? Maybe we have some memory leak.


was (Author: zsxwing):
[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers the executor shut down. Should we try to figure out why 
OOM happens? Maybe we have some memory leak.

> Synchronize Worker Cleanup with Worker Shutdown
> ---
>
> Key: SPARK-26152
> URL: https://issues.apache.org/jira/browse/SPARK-26152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Ajith S
>Priority: Critical
> Fix For: 2.4.4, 3.0.0
>
> Attachments: Screenshot from 2019-03-11 17-03-40.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627
>  (2018-11-16)
> {code}
> BroadcastSuite:
> - Using TorrentBroadcast locally
> - Accessing TorrentBroadcast variables from multiple threads
> - Accessing TorrentBroadcast variables in a local cluster (encryption = off)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@59428a1 rejected from 
> java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = 
> 1, active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
>   at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.Threa

[jira] [Commented] (SPARK-26152) Synchronize Worker Cleanup with Worker Shutdown

2019-08-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902477#comment-16902477
 ] 

Shixiong Zhu commented on SPARK-26152:
--

[~ajithshetty] does your PR fix the flaky test? If I read correctly, the 
failure is because of OOM. RejectedExecutionException is just side effect 
because OOM triggers the executor shut down. Should we try to figure out why 
OOM happens? Maybe we have some memory leak.

> Synchronize Worker Cleanup with Worker Shutdown
> ---
>
> Key: SPARK-26152
> URL: https://issues.apache.org/jira/browse/SPARK-26152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Ajith S
>Priority: Critical
> Fix For: 2.4.4, 3.0.0
>
> Attachments: Screenshot from 2019-03-11 17-03-40.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627
>  (2018-11-16)
> {code}
> BroadcastSuite:
> - Using TorrentBroadcast locally
> - Accessing TorrentBroadcast variables from multiple threads
> - Accessing TorrentBroadcast variables in a local cluster (encryption = off)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@59428a1 rejected from 
> java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = 
> 1, active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
>   at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from 
> java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, 
> active threads = 1, queued tasks = 0, complete

[jira] [Created] (SPARK-28650) Fix the guarantee of ForeachWriter

2019-08-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28650:


 Summary: Fix the guarantee of ForeachWriter
 Key: SPARK-28650
 URL: https://issues.apache.org/jira/browse/SPARK-28650
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Right now ForeachWriter has the following guarantee:

{code}

If the streaming query is being executed in the micro-batch mode, then every 
partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the 
same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally 
commit data
and achieve exactly-once guarantees.

{code}

 

But we can break this easily actually when restarting a query but a batch is 
re-run (e.g., upgrade Spark)
 * Source returns a different DataFrame that has a different partition number 
(e.g., we start to not create empty partitions in Kafka Source V2).
 * A new added optimization rule may change the number of partitions in the new 
run.
 * Change the file split size in the new run.

Since we cannot guarantee that the same (partitionId, epochId) has the same 
data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899789#comment-16899789
 ] 

Shixiong Zhu commented on SPARK-28605:
--

By the way, this is not a critical regression. It's not easy to hit this issue.

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899787#comment-16899787
 ] 

Shixiong Zhu commented on SPARK-28605:
--

This is a regression at all 2.4 branches. It's caused by SPARK-23099.

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28605:
-
Affects Version/s: 2.4.0
   2.4.1
   2.4.2

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28574) Allow to config different sizes for event queues

2019-08-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28574.
--
Resolution: Fixed

> Allow to config different sizes for event queues
> 
>
> Key: SPARK-28574
> URL: https://issues.apache.org/jira/browse/SPARK-28574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yun Zou
>Assignee: Yun Zou
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now all event queues shard the same size config. We should allow the 
> user to set a size for a special event queue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28574) Allow to config different sizes for event queues

2019-08-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28574:


Assignee: Yun Zou

> Allow to config different sizes for event queues
> 
>
> Key: SPARK-28574
> URL: https://issues.apache.org/jira/browse/SPARK-28574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yun Zou
>Assignee: Yun Zou
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now all event queues shard the same size config. We should allow the 
> user to set a size for a special event queue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28605) Performance regression in SS's foreach

2019-08-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28605:
-
Labels: regresssion  (was: )

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28605) Performance regression in SS's foreach

2019-08-02 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28605:


 Summary: Performance regression in SS's foreach
 Key: SPARK-28605
 URL: https://issues.apache.org/jira/browse/SPARK-28605
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
partition without reading data. But in ForeachSink v2, due to the API 
limitation, it needs to read the whole partition even if all data just gets 
dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2019-07-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28556:
-
Docs Text: In Spark 3.0, the type of "error" parameter in the 
"org.apache.spark.sql.util.QueryExecutionListener.onFailure" method is changed 
to "java.lang.Throwable" from "java.lang.Exception" to accept more types of 
failures such as "java.lang.Error" and its subclasses.  (was: In Spark 3.0, the 
type of "error" parameter in the 
"org.apache.spark.sql.util.QueryExecutionListener.onFailure" method is changed 
to "Throwable" from "Exception" to accept more types of failures such as 
java.lang.Error and its subclasses.)

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2019-07-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28556:
-
Docs Text: In Spark 3.0, the type of "error" parameter in the 
"org.apache.spark.sql.util.QueryExecutionListener.onFailure" method is changed 
to "Throwable" from "Exception" to accept more types of failures such as 
java.lang.Error and its subclasses.

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2019-07-29 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28556:
-
Labels: release-notes  (was: )

> Error should also be sent to QueryExecutionListener.onFailure
> -
>
> Key: SPARK-28556
> URL: https://issues.apache.org/jira/browse/SPARK-28556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
>
> Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
> any Error when running a query, QueryExecutionListener.onFailure cannot be 
> triggered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28556) Error should also be sent to QueryExecutionListener.onFailure

2019-07-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28556:


 Summary: Error should also be sent to 
QueryExecutionListener.onFailure
 Key: SPARK-28556
 URL: https://issues.apache.org/jira/browse/SPARK-28556
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now Error is not sent to QueryExecutionListener.onFailure. If there is 
any Error when running a query, QueryExecutionListener.onFailure cannot be 
triggered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16754) NPE when defining case class and searching Encoder in the same line

2019-07-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893125#comment-16893125
 ] 

Shixiong Zhu commented on SPARK-16754:
--

I think prepending 
`org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)` to 
every command should fix the problem. I already forget why we didn't do this 
for Scala 2.11.

> NPE when defining case class and searching Encoder in the same line
> ---
>
> Key: SPARK-16754
> URL: https://issues.apache.org/jira/browse/SPARK-16754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.3.0
> Environment: Spark Shell for Scala 2.11
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class TestCaseClass(value: Int)
> import spark.implicits._
> Seq(TestCaseClass(1)).toDS().collect()
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: baseClassName: $line14.$read
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:821)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection$lzycompute(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:289)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2568)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2217)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.

[jira] [Updated] (SPARK-28489) KafkaOffsetRangeCalculator.getRanges may drop offsets

2019-07-24 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28489:
-
Affects Version/s: 2.4.0
   2.4.1
   2.4.2

> KafkaOffsetRangeCalculator.getRanges may drop offsets
> -
>
> Key: SPARK-28489
> URL: https://issues.apache.org/jira/browse/SPARK-28489
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness, dataloss
>
> KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off errors.
>  
> This only affects queries using "minPartitions" option. A workaround is just 
> removing the "minPartitions" option from the query.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28489) KafkaOffsetRangeCalculator.getRanges may drop offsets

2019-07-24 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28489:
-
Description: 
KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off errors.

 

This only affects queries using "minPartitions" option. A workaround is just 
removing the "minPartitions" option from the query.

  was:KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off 
errors.


> KafkaOffsetRangeCalculator.getRanges may drop offsets
> -
>
> Key: SPARK-28489
> URL: https://issues.apache.org/jira/browse/SPARK-28489
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness, dataloss
>
> KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off errors.
>  
> This only affects queries using "minPartitions" option. A workaround is just 
> removing the "minPartitions" option from the query.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28489) KafkaOffsetRangeCalculator.getRanges may drop offsets

2019-07-23 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28489:


 Summary: KafkaOffsetRangeCalculator.getRanges may drop offsets
 Key: SPARK-28489
 URL: https://issues.apache.org/jira/browse/SPARK-28489
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


KafkaOffsetRangeCalculator.getRanges may drop offsets due to round off errors.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28486) PythonBroadcast may delete the broadcast file while a Python worker still needs it

2019-07-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28486:
-
Issue Type: Bug  (was: New Feature)

> PythonBroadcast may delete the broadcast file while a Python worker still 
> needs it
> --
>
> Key: SPARK-28486
> URL: https://issues.apache.org/jira/browse/SPARK-28486
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Steps to reproduce:
>  * Run "bin/pyspark --master local[1,1] --conf spark.memory.fraction=0.0001" 
> to start PySpark
>  * Run the following codes:
> {code:java}
> b = sc.broadcast([100])
> sc.parallelize([0],1).map(lambda x: 0 if x == 0 else b.value[0]).collect()
> sc._jvm.java.lang.System.gc()
> import time
> time.sleep(5)
> sc._jvm.java.lang.System.gc()
> time.sleep(5)
> sc.parallelize([1],1).map(lambda x: 0 if x == 0 else b.value[0]).collect()
> {code}
> * Error:
> {code}
> IOError: [Errno 2] No such file or directory: 
> u'.../spark-ee2a0da1-7d2e-48fd-be9a-fdcc89c5076c/broadcast4970491472715621982'
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   ... 1 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28486) PythonBroadcast may delete the broadcast file while a Python worker still needs it

2019-07-23 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28486:


 Summary: PythonBroadcast may delete the broadcast file while a 
Python worker still needs it
 Key: SPARK-28486
 URL: https://issues.apache.org/jira/browse/SPARK-28486
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Steps to reproduce:
 * Run "bin/pyspark --master local[1,1] --conf spark.memory.fraction=0.0001" to 
start PySpark
 * Run the following codes:

{code:java}
b = sc.broadcast([100])
sc.parallelize([0],1).map(lambda x: 0 if x == 0 else b.value[0]).collect()
sc._jvm.java.lang.System.gc()
import time
time.sleep(5)
sc._jvm.java.lang.System.gc()
time.sleep(5)
sc.parallelize([1],1).map(lambda x: 0 if x == 0 else b.value[0]).collect()
{code}
* Error:

{code}
IOError: [Errno 2] No such file or directory: 
u'.../spark-ee2a0da1-7d2e-48fd-be9a-fdcc89c5076c/broadcast4970491472715621982'

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at 
org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 
org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at 
org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28456) Add a public API `Encoder.makeCopy` to allow creating Encoder without touching Scala reflections

2019-07-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28456:
-
Description: 
Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in 
multiple `Dataset`s. However, creating an `Encoder` for a complicated class is 
slow due to Scala reflections. To reduce the cost of Encoder creation, right 
now I usually use the private API `ExpressionEncoder.copy` as follows:
{code:java}
object FooEncoder {
 private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
 implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
}
{code}
This PR proposes a new method `makeCopy` in `Encoder` so that the above codes 
can be rewritten using public APIs.
{code:java}
object FooEncoder {
 private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
 implicit def encoder: Encoder[Foo] = _encoder.makeCopy
}
{code}

  was:
Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in 
multiple `Dataset`s. However, creating an `Encoder` for a complicated class is 
slow due to Scala reflections. To reduce the cost of Encoder creation, right 
now I usually use the private API `ExpressionEncoder.copy` as follows:

{code}
object FooEncoder {
 private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
 implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
}
{code}

This PR proposes a new method `copyEncoder` in `Encoder` so that the above 
codes can be rewritten using public APIs.

{code}
object FooEncoder {
 private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
 implicit def encoder: Encoder[Foo] = _encoder.copyEncoder()
}
{code}

Regarding the method name, 
- Why not use `copy`? It conflicts with `case class`'s copy.
- Why not use `clone`? It conflicts with `Object.clone`.

Summary: Add a public API `Encoder.makeCopy` to allow creating Encoder 
without touching Scala reflections  (was: Add a public API 
`Encoder.copyEncoder` to allow creating Encoder without touching Scala 
reflections)

> Add a public API `Encoder.makeCopy` to allow creating Encoder without 
> touching Scala reflections
> 
>
> Key: SPARK-28456
> URL: https://issues.apache.org/jira/browse/SPARK-28456
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in 
> multiple `Dataset`s. However, creating an `Encoder` for a complicated class 
> is slow due to Scala reflections. To reduce the cost of Encoder creation, 
> right now I usually use the private API `ExpressionEncoder.copy` as follows:
> {code:java}
> object FooEncoder {
>  private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
>  implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
> }
> {code}
> This PR proposes a new method `makeCopy` in `Encoder` so that the above codes 
> can be rewritten using public APIs.
> {code:java}
> object FooEncoder {
>  private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
>  implicit def encoder: Encoder[Foo] = _encoder.makeCopy
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28456) Add a public API `Encoder.copyEncoder` to allow creating Encoder without touching Scala reflections

2019-07-19 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-28456:


 Summary: Add a public API `Encoder.copyEncoder` to allow creating 
Encoder without touching Scala reflections
 Key: SPARK-28456
 URL: https://issues.apache.org/jira/browse/SPARK-28456
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.3
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in 
multiple `Dataset`s. However, creating an `Encoder` for a complicated class is 
slow due to Scala reflections. To reduce the cost of Encoder creation, right 
now I usually use the private API `ExpressionEncoder.copy` as follows:

{code}
object FooEncoder {
 private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
 implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
}
{code}

This PR proposes a new method `copyEncoder` in `Encoder` so that the above 
codes can be rewritten using public APIs.

{code}
object FooEncoder {
 private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
 implicit def encoder: Encoder[Foo] = _encoder.copyEncoder()
}
{code}

Regarding the method name, 
- Why not use `copy`? It conflicts with `case class`'s copy.
- Why not use `clone`? It conflicts with `Object.clone`.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2019-05-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20547.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 3.0.0
>
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27711) InputFileBlockHolder should be unset at the end of tasks

2019-05-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27711:
-
Component/s: PySpark

> InputFileBlockHolder should be unset at the end of tasks
> 
>
> Key: SPARK-27711
> URL: https://issues.apache.org/jira/browse/SPARK-27711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Torres
>Priority: Major
>
> InputFileBlockHolder should be unset at the end of each task. Otherwise the 
> value of input_file_name() can leak over to other tasks instead of beginning 
> as empty string.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2019-05-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20547:
-
Labels:   (was: bulk-closed)

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-20547:
--
  Assignee: Shixiong Zhu

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: bulk-closed
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11095) Simplify Netty RPC implementation by using a separate thread pool for each endpoint

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-11095:
--

> Simplify Netty RPC implementation by using a separate thread pool for each 
> endpoint
> ---
>
> Key: SPARK-11095
> URL: https://issues.apache.org/jira/browse/SPARK-11095
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> The dispatcher class and the inbox class of the current Netty-based RPC 
> implementation is fairly complicated. It uses a single, shared thread pool to 
> execute all the endpoints. This is similar to how Akka does actor message 
> dispatching. The benefit of this design is that this RPC implementation can 
> support a very large number of endpoints, as they are all multiplexed into a 
> single thread pool for execution. The downside is the complexity resulting 
> from synchronization and coordination.
> An alternative implementation is to have a separate message queue and thread 
> pool for each endpoint. The dispatcher simply routes the messages to the 
> appropriate message queue, and the threads poll the queue for messages to 
> process.
> If the endpoint is single threaded, then the thread pool should contain only 
> a single thread. If the endpoint supports concurrent execution, then the 
> thread pool should contain more threads.
> Two additional things we need to be careful with are:
> 1. An endpoint should only process normal messages after OnStart is called. 
> This can be done by having the thread that starts the endpoint processing 
> OnStart.
> 2. An endpoint should process OnStop after all normal messages have been 
> processed. I think this can be done by having a busy loop to spin until the 
> size of the message queue is 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11095) Simplify Netty RPC implementation by using a separate thread pool for each endpoint

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-11095.
--
Resolution: Won't Do

> Simplify Netty RPC implementation by using a separate thread pool for each 
> endpoint
> ---
>
> Key: SPARK-11095
> URL: https://issues.apache.org/jira/browse/SPARK-11095
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> The dispatcher class and the inbox class of the current Netty-based RPC 
> implementation is fairly complicated. It uses a single, shared thread pool to 
> execute all the endpoints. This is similar to how Akka does actor message 
> dispatching. The benefit of this design is that this RPC implementation can 
> support a very large number of endpoints, as they are all multiplexed into a 
> single thread pool for execution. The downside is the complexity resulting 
> from synchronization and coordination.
> An alternative implementation is to have a separate message queue and thread 
> pool for each endpoint. The dispatcher simply routes the messages to the 
> appropriate message queue, and the threads poll the queue for messages to 
> process.
> If the endpoint is single threaded, then the thread pool should contain only 
> a single thread. If the endpoint supports concurrent execution, then the 
> thread pool should contain more threads.
> Two additional things we need to be careful with are:
> 1. An endpoint should only process normal messages after OnStart is called. 
> This can be done by having the thread that starts the endpoint processing 
> OnStart.
> 2. An endpoint should process OnStop after all normal messages have been 
> processed. I think this can be done by having a busy loop to spin until the 
> size of the message queue is 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-17858:
--
  Assignee: Shixiong Zhu

> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
> want to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files. It's better to include 
> the paths in the error message.
> Note: In Spark 1.6, Spark SQL always skip corrupt files because of 
> SPARK-17850.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17858.
--
Resolution: Duplicate

> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
> want to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files. It's better to include 
> the paths in the error message.
> Note: In Spark 1.6, Spark SQL always skip corrupt files because of 
> SPARK-17850.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-10719:
-
Fix Version/s: 2.3.0

> SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
> --
>
> Key: SPARK-10719
> URL: https://issues.apache.org/jira/browse/SPARK-10719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.0
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
> Fix For: 2.3.0
>
>
> Sometimes the following codes failed
> {code}
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> (1 to 1000).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}
> The stack trace is 
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: tail of 
> empty list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
>   at 
> scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
>   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
>   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
>   at 
> scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
>   at 
> scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at scala.Function1$class.apply$mcVI$sp(Function1.scala:39)
>   at 
> scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12)
>   at 
> scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Finally, I found the problem. The codes generated by Scala compiler to find 
> the implicit TypeTag are not thread safe because of an issue in Scala 2.10: 
> https://issues.scala-lang.org/browse/SI-6240
> This issue was fixed in Scala 2.11 but not backported to 2.10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10

2019-05-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-10719.

Assignee: Shixiong Zhu

We can close this since Scala 2.10 has been dropped in Spark 2.3.0.

> SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
> --
>
> Key: SPARK-10719
> URL: https://issues.apache.org/jira/browse/SPARK-10719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0, 1.6.0
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> Sometimes the following codes failed
> {code}
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> (1 to 1000).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}
> The stack trace is 
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: tail of 
> empty list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
>   at 
> scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
>   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
>   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
>   at 
> scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
>   at 
> scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at scala.Function1$class.apply$mcVI$sp(Function1.scala:39)
>   at 
> scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12)
>   at 
> scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Finally, I found the problem. The codes generated by Scala compiler to find 
> the implicit TypeTag are not thread safe because of an issue in Scala 2.10: 
> https://issues.scala-lang.org/browse/SI-6240
> This issue was fixed in Scala 2.11 but not backported to 2.10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27753) Support SQL expressions for interval parameter in Structured Streaming

2019-05-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27753:


 Summary: Support SQL expressions for interval parameter in 
Structured Streaming
 Key: SPARK-27753
 URL: https://issues.apache.org/jira/browse/SPARK-27753
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Structured Streaming has several methods that accept an interval string. It 
would be great that we can use the parser to parse it so that we can also 
support SQL expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27735) Interval string in upper case is not supported in Trigger

2019-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27735:
-
Description: Some APIs in Structured Streaming requires the user to specify 
an interval. Right now these APIs don't accept upper-case strings.

> Interval string in upper case is not supported in Trigger
> -
>
> Key: SPARK-27735
> URL: https://issues.apache.org/jira/browse/SPARK-27735
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Some APIs in Structured Streaming requires the user to specify an interval. 
> Right now these APIs don't accept upper-case strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27735) Interval string in upper case is not supported in Trigger

2019-05-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27735:


 Summary: Interval string in upper case is not supported in Trigger
 Key: SPARK-27735
 URL: https://issues.apache.org/jira/browse/SPARK-27735
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27494) Null keys/values don't work in Kafka source v2

2019-04-26 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27494:
-
Description: 
Right now Kafka source v2 doesn't support null keys or values.
 * When processing a null key, all of the following keys in the same partition 
will be null. This is a correctness bug.
 * When processing a null value, it will throw NPE.

The workaround is setting sql conf 
"spark.sql.streaming.disabledV2MicroBatchReaders" to 
"org.apache.spark.sql.kafka010.KafkaSourceProvider" to use the v1 source.

  was:Right now Kafka source v2 doesn't support null values. The issue is in 
org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which 
doesn't handle null values.


> Null keys/values don't work in Kafka source v2
> --
>
> Key: SPARK-27494
> URL: https://issues.apache.org/jira/browse/SPARK-27494
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Genmao Yu
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0, 2.4.3
>
>
> Right now Kafka source v2 doesn't support null keys or values.
>  * When processing a null key, all of the following keys in the same 
> partition will be null. This is a correctness bug.
>  * When processing a null value, it will throw NPE.
> The workaround is setting sql conf 
> "spark.sql.streaming.disabledV2MicroBatchReaders" to 
> "org.apache.spark.sql.kafka010.KafkaSourceProvider" to use the v1 source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27494) Null keys/values don't work in Kafka source v2

2019-04-26 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27494:
-
Summary: Null keys/values don't work in Kafka source v2  (was: Null values 
don't work in Kafka source v2)

> Null keys/values don't work in Kafka source v2
> --
>
> Key: SPARK-27494
> URL: https://issues.apache.org/jira/browse/SPARK-27494
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Genmao Yu
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0, 2.4.3
>
>
> Right now Kafka source v2 doesn't support null values. The issue is in 
> org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow 
> which doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27494) Null values don't work in Kafka source v2

2019-04-26 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27494:
-
Labels: correctness  (was: )

> Null values don't work in Kafka source v2
> -
>
> Key: SPARK-27494
> URL: https://issues.apache.org/jira/browse/SPARK-27494
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Genmao Yu
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0, 2.4.3
>
>
> Right now Kafka source v2 doesn't support null values. The issue is in 
> org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow 
> which doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27496) RPC should send back the fatal errors

2019-04-17 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27496:


 Summary: RPC should send back the fatal errors
 Key: SPARK-27496
 URL: https://issues.apache.org/jira/browse/SPARK-27496
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now, when a fatal error throws from "receiveAndReply", the sender will 
not be notified. We should try our best to send it back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-17 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820460#comment-16820460
 ] 

Shixiong Zhu commented on SPARK-27468:
--

[~shahid] You need to use "--master local-cluster[2,1,1024]". The local mode 
has only one BlockManager.

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: Screenshot from 2019-04-17 10-42-55.png
>
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27494) Null values don't work in Kafka source v2

2019-04-17 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27494:


 Summary: Null values don't work in Kafka source v2
 Key: SPARK-27494
 URL: https://issues.apache.org/jira/browse/SPARK-27494
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.1
Reporter: Shixiong Zhu


Right now Kafka source v2 doesn't support null values. The issue is in 
org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which 
doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27468:
-
Description: 
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to [~srfnmnk] who reported this issue originally.

  was:
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to @dani


> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---

[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27468:
-
Description: 
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to @dani

  was:
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.


> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to @dani



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
F

[jira] [Created] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27468:


 Summary: "Storage Level" in "RDD Storage Page" is not correct
 Key: SPARK-27468
 URL: https://issues.apache.org/jira/browse/SPARK-27468
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1
Reporter: Shixiong Zhu


I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish

2019-04-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27394:
-
Fix Version/s: 2.4.2

> The staleness of UI may last minutes or hours when no tasks start or finish
> ---
>
> Key: SPARK-27394
> URL: https://issues.apache.org/jira/browse/SPARK-27394
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> Run the following codes on a cluster that has at least 2 cores.
> {code}
> sc.makeRDD(1 to 1000, 1000).foreach { i =>
>   Thread.sleep(30)
> }
> {code}
> The jobs page will just show one running task.
> This is because when the second task event calls 
> "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap 
> between two events is smaller than `spark.ui.liveUpdate.period`.
> After the second task event, in the above case, because there won't be any 
> other task events, the Spark UI will be always stale until the next task 
> event gets fired (after 300 seconds).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail

2019-04-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27419.
--
   Resolution: Fixed
Fix Version/s: 2.4.2

> When setting spark.executor.heartbeatInterval to a value less than 1 seconds, 
> it will always fail
> -
>
> Key: SPARK-27419
> URL: https://issues.apache.org/jira/browse/SPARK-27419
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.2
>
>
> When setting spark.executor.heartbeatInterval to a value less than 1 seconds 
> in branch-2.4, it will always fail because the value will be converted to 0 
> and the heartbeat will always timeout and finally kill the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail

2019-04-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27419:


 Summary: When setting spark.executor.heartbeatInterval to a value 
less than 1 seconds, it will always fail
 Key: SPARK-27419
 URL: https://issues.apache.org/jira/browse/SPARK-27419
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1, 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When setting spark.executor.heartbeatInterval to a value less than 1 seconds in 
branch-2.4, it will always fail because the value will be converted to 0 and 
the heartbeat will always timeout and finally kill the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812657#comment-16812657
 ] 

Shixiong Zhu commented on SPARK-27348:
--

[~sandeep.katta2007] I cannot reproduce this locally. Ideally, when we decide 
to remove an executor, we should remove it from all places rather than counting 
on a TCP disconnect event which may not happen sometimes. 

> HeartbeatReceiver doesn't remove lost executors from 
> CoarseGrainedSchedulerBackend
> --
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
> executors from CoarseGrainedSchedulerBackend. When a connection of an 
> executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
> receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
> thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
> TaskScheduler to run tasks on this lost executor. This task will never finish 
> and the job will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish

2019-04-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27394:


 Summary: The staleness of UI may last minutes or hours when no 
tasks start or finish
 Key: SPARK-27394
 URL: https://issues.apache.org/jira/browse/SPARK-27394
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.1, 2.4.0
Reporter: Shixiong Zhu


Run the following codes on a cluster that has at least 2 cores.
{code}
sc.makeRDD(1 to 1000, 1000).foreach { i =>
  Thread.sleep(30)
}
{code}

The jobs page will just show one running task.

This is because when the second task event calls 
"AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap 
between two events is smaller than `spark.ui.liveUpdate.period`.

After the second task event, in the above case, because there won't be any 
other task events, the Spark UI will be always stale until the next task event 
gets fired (after 300 seconds).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish

2019-04-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-27394:


Assignee: Shixiong Zhu

> The staleness of UI may last minutes or hours when no tasks start or finish
> ---
>
> Key: SPARK-27394
> URL: https://issues.apache.org/jira/browse/SPARK-27394
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Run the following codes on a cluster that has at least 2 cores.
> {code}
> sc.makeRDD(1 to 1000, 1000).foreach { i =>
>   Thread.sleep(30)
> }
> {code}
> The jobs page will just show one running task.
> This is because when the second task event calls 
> "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap 
> between two events is smaller than `spark.ui.liveUpdate.period`.
> After the second task event, in the above case, because there won't be any 
> other task events, the Spark UI will be always stale until the next task 
> event gets fired (after 300 seconds).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27348:
-
Description: When a heartbeat timeout happens in HeartbeatReceiver, it 
doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a 
connection is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
TaskScheduler to run tasks on this lost executor. This task will never finish 
and the job will hang forever.  (was: When a heartbeat timeout happens in 
HeartbeatReceiver, it doesn't remove lost executors from 
CoarseGrainedSchedulerBackend. When a connection is gracefully shut down, 
CoarseGrainedSchedulerBackend will not receive a disconnect event. In this 
case, CoarseGrainedSchedulerBackend still thinks a lost executor is still 
alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this 
lost executor. This task will never finish and the job will hang forever.)

> HeartbeatReceiver doesn't remove lost executors from 
> CoarseGrainedSchedulerBackend
> --
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
> executors from CoarseGrainedSchedulerBackend. When a connection is not 
> gracefully shut down, CoarseGrainedSchedulerBackend may not receive a 
> disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a 
> lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
> TaskScheduler to run tasks on this lost executor. This task will never finish 
> and the job will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27348:
-
Description: When a heartbeat timeout happens in HeartbeatReceiver, it 
doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a 
connection of an executor is not gracefully shut down, 
CoarseGrainedSchedulerBackend may not receive a disconnect event. In this case, 
CoarseGrainedSchedulerBackend still thinks a lost executor is still alive. 
CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this lost 
executor. This task will never finish and the job will hang forever.  (was: 
When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
executors from CoarseGrainedSchedulerBackend. When a connection is not 
gracefully shut down, CoarseGrainedSchedulerBackend may not receive a 
disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a 
lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
TaskScheduler to run tasks on this lost executor. This task will never finish 
and the job will hang forever.)

> HeartbeatReceiver doesn't remove lost executors from 
> CoarseGrainedSchedulerBackend
> --
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
> executors from CoarseGrainedSchedulerBackend. When a connection of an 
> executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
> receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
> thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
> TaskScheduler to run tasks on this lost executor. This task will never finish 
> and the job will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27348) HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

2019-04-02 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27348:


 Summary: HeartbeatReceiver doesn't remove lost executors from 
CoarseGrainedSchedulerBackend
 Key: SPARK-27348
 URL: https://issues.apache.org/jira/browse/SPARK-27348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost 
executors from CoarseGrainedSchedulerBackend. When a connection is gracefully 
shut down, CoarseGrainedSchedulerBackend will not receive a disconnect event. 
In this case, CoarseGrainedSchedulerBackend still thinks a lost executor is 
still alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks 
on this lost executor. This task will never finish and the job will hang 
forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27275) Potential corruption in EncryptedMessage.transferTo

2019-03-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27275:
-
Labels: correctness  (was: )

> Potential corruption in EncryptedMessage.transferTo
> ---
>
> Key: SPARK-27275
> URL: https://issues.apache.org/jira/browse/SPARK-27275
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: correctness
>
> `EncryptedMessage.transferTo` has a potential corruption issue. When the 
> underlying buffer has more than `1024 * 32` bytes (this should be rare but it 
> could happen in error messages that send over the wire), it may just send a 
> partial message as `EncryptedMessage.count` becomes less than `transferred`. 
> This will cause the client hang forever (or timeout) as it will wait until 
> receiving expected length of bytes,  or weird errors (such as corruption or 
> silent correctness issue) if the channel is reused by other messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27275) Potential corruption in EncryptedMessage.transferTo

2019-03-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27275:


 Summary: Potential corruption in EncryptedMessage.transferTo
 Key: SPARK-27275
 URL: https://issues.apache.org/jira/browse/SPARK-27275
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


`EncryptedMessage.transferTo` has a potential corruption issue. When the 
underlying buffer has more than `1024 * 32` bytes (this should be rare but it 
could happen in error messages that send over the wire), it may just send a 
partial message as `EncryptedMessage.count` becomes less than `transferred`. 
This will cause the client hang forever (or timeout) as it will wait until 
receiving expected length of bytes,  or weird errors (such as corruption or 
silent correctness issue) if the channel is reused by other messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27210) Cleanup incomplete output files in ManifestFileCommitProtocol if task is aborted

2019-03-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27210.
--
   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 3.0.0

> Cleanup incomplete output files in ManifestFileCommitProtocol if task is 
> aborted
> 
>
> Key: SPARK-27210
> URL: https://issues.apache.org/jira/browse/SPARK-27210
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> Unlike HadoopMapReduceCommitProtocol, ManifestFileCommitProtocol doesn't 
> clean up incomplete output files for both cases: task is aborted as well as 
> job is aborted.
> In HadoopMapReduceCommitProtocol, it leverages stage directory to write 
> intermediate files so once job is aborted it can simply delete stage 
> directory to clean up everything. Even HadoopMapReduceCommitProtocol puts 
> more effort on cleaning up intermediate files on task side if task is aborted.
> ManifestFileCommitProtocol doesn't do anything for cleaning up but just 
> maintains the metadata which list of complete output files are written. It 
> should be better if ManifestFileCommitProtocol can do the best effort to 
> clean up: not sure it can do job level cleanup since it doesn't leverage 
> stage directory, but it's clear that it can still put best effort to do task 
> level cleanup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27221) Improve the assert error message in TreeNode.parseToJson

2019-03-20 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27221:
-
Summary: Improve the assert error message in TreeNode.parseToJson  (was: 
Improve the assert error message in TreeNode)

> Improve the assert error message in TreeNode.parseToJson
> 
>
> Key: SPARK-27221
> URL: https://issues.apache.org/jira/browse/SPARK-27221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> When TreeNode.parseToJson may throw an assert error without any error message 
> when a TreeNode is not implemented properly, and it's hard to find the bad 
> TreeNode implementation.
> It's better to improve the error message to indicate the type of TreeNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >