date:20190416

[jira] [Assigned] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27416:
---

Assignee: peng bo

> UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines 
> have different Oops size
> 
>
> Key: SPARK-27416
> URL: https://issues.apache.org/jira/browse/SPARK-27416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Assignee: peng bo
>Priority: Major
>
> Actually this's follow up for 
> https://issues.apache.org/jira/browse/SPARK-27406, 
> https://issues.apache.org/jira/browse/SPARK-10914
> This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
> issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27416.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24357
[https://github.com/apache/spark/pull/24357]

> UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines 
> have different Oops size
> 
>
> Key: SPARK-27416
> URL: https://issues.apache.org/jira/browse/SPARK-27416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Assignee: peng bo
>Priority: Major
> Fix For: 3.0.0
>
>
> Actually this's follow up for 
> https://issues.apache.org/jira/browse/SPARK-27406, 
> https://issues.apache.org/jira/browse/SPARK-10914
> This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
> issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27485) Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1

2019-04-16 Thread Muthu Jayakumar (JIRA)

Muthu Jayakumar created SPARK-27485:
---

 Summary: Certain query plans fail to run when 
autoBroadcastJoinThreshold is set to -1
 Key: SPARK-27485
 URL: https://issues.apache.org/jira/browse/SPARK-27485
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.4.0
Reporter: Muthu Jayakumar


Certain queries fail with
{noformat}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:349)
at scala.None$.get(Option.scala:347)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1(EnsureRequirements.scala:238)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1$adapted(EnsureRequirements.scala:233)
at scala.collection.immutable.List.foreach(List.scala:388)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:233)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:262)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:289)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$4(TreeNode.scala:282)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:282)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296)
at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:87)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:122)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:118)
at scala.collection.immutable.List.foldLeft(List.scala:85)
{noformat}

I don't have an exact query reproducer for this. But, I can try to frame one if 
this problem hasn't been reported in the past?



--
This message was sent by Atlassian

[jira] [Assigned] (SPARK-27483) move the data source v2 fallback to v1 logic to an analyzer rule

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27483:
---

Assignee: (was: Wenchen Fan)

> move the data source v2 fallback to v1 logic to an analyzer rule
> 
>
> Key: SPARK-27483
> URL: https://issues.apache.org/jira/browse/SPARK-27483
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27484) create the streaming writing logical plan node before query is analyzed

2019-04-16 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27484:
---

 Summary: create the streaming writing logical plan node before 
query is analyzed
 Key: SPARK-27484
 URL: https://issues.apache.org/jira/browse/SPARK-27484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27483) move the data source v2 fallback to v1 logic to an analyzer rule

2019-04-16 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27483:
---

 Summary: move the data source v2 fallback to v1 logic to an 
analyzer rule
 Key: SPARK-27483
 URL: https://issues.apache.org/jira/browse/SPARK-27483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page

2019-04-16 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27482:

Description: 
Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
{{SparkPlan}} node. However with {{statistics}} info may help us understand how 
the plan is designed and the reason it runs slowly.

This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
info on {{SparkSQL}} UI page first when it's available.

  was:
Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
{{SparkPlan}} node. However with {{statistics}} info may help us understand how 
the plan is designed and the reason it runs slowly.

This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
info on {{SparkSQL}} UI page first.


> Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
> 
>
> Key: SPARK-27482
> URL: https://issues.apache.org/jira/browse/SPARK-27482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
> Attachments: SPARK-27482-1.png
>
>
> Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
> {{SparkPlan}} node. However with {{statistics}} info may help us understand 
> how the plan is designed and the reason it runs slowly.
> This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
> info on {{SparkSQL}} UI page first when it's available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page

2019-04-16 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27482:

Summary: Show BroadcastHashJoinExec numOutputRows statistics info on 
SparkSQL UI page  (was: Show BroadcastHashJoinExec numOutputRows statistic info 
on SparkSQL UI page)

> Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
> 
>
> Key: SPARK-27482
> URL: https://issues.apache.org/jira/browse/SPARK-27482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
> Attachments: SPARK-27482-1.png
>
>
> Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
> {{SparkPlan}} node. However with {{statistics}} info may help us understand 
> how the plan is designed and the reason it runs slowly.
> This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
> info on {{SparkSQL}} UI page first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page

2019-04-16 Thread peng bo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27482:

Attachment: SPARK-27482-1.png

> Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page
> ---
>
> Key: SPARK-27482
> URL: https://issues.apache.org/jira/browse/SPARK-27482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
> Attachments: SPARK-27482-1.png
>
>
> Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
> {{SparkPlan}} node. However with {{statistics}} info may help us understand 
> how the plan is designed and the reason it runs slowly.
> This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
> info on {{SparkSQL}} UI page first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page

2019-04-16 Thread peng bo (JIRA)

peng bo created SPARK-27482:
---

 Summary: Show BroadcastHashJoinExec numOutputRows statistic info 
on SparkSQL UI page
 Key: SPARK-27482
 URL: https://issues.apache.org/jira/browse/SPARK-27482
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: peng bo


Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each 
{{SparkPlan}} node. However with {{statistics}} info may help us understand how 
the plan is designed and the reason it runs slowly.

This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} 
info on {{SparkSQL}} UI page first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27479) Hide API docs for "org.apache.spark.util.kvstore"

2019-04-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27479.
-
   Resolution: Fixed
Fix Version/s: 2.4.2

> Hide API docs for "org.apache.spark.util.kvstore"
> -
>
> Key: SPARK-27479
> URL: https://issues.apache.org/jira/browse/SPARK-27479
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.3, 2.4.1, 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.2
>
>
> The API docs should not include the "org.apache.spark.util.kvstore" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819679#comment-16819679
 ] 

shahid commented on SPARK-27468:


Hi [~srfnmnk], I tried to reproduce in the master branch. The steps followed 
shown below
1) bin/spark-shell --master local[2]


{code:java}
scala> import org.apache.spark.storage.StorageLevel
scala> val rdd = sc.parallelize(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
scala > rdd.count
{code}

Storage tab in the UI is shown below ,

 !Screenshot from 2019-04-17 10-42-55.png! 

So, it seems I am not able to reproduce the issue. Could you please tell me if 
the test steps are correct or I need to enable any configurations. Thank you


> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: Screenshot from 2019-04-17 10-42-55.png
>
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-27468:
---
Attachment: Screenshot from 2019-04-17 10-42-55.png

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: Screenshot from 2019-04-17 10-42-55.png
>
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2019-04-16 Thread Kevin Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Zhang updated SPARK-24630:

Comment: was deleted

(was: thanks [~Jackey Lee]
So I'm wondering what's blocking the pr of this issue to be merged, is it 
related to DataSourceV2?)

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP V2.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27481) Upgrade commons-logging to 1.1.3

2019-04-16 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27481:
---

 Summary: Upgrade commons-logging to 1.1.3
 Key: SPARK-27481
 URL: https://issues.apache.org/jira/browse/SPARK-27481
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


We may hint the {{LogConfigurationException}} when upgrade built-in Hive to 
2.3.4:
{noformat}
bin/spark-sql --conf spark.sql.hive.metastore.version=1.2.2 --conf 
spark.sql.hive.metastore.jars=file:///apache/hive-1.2.2-bin/lib/*
...

19/04/16 19:04:06 ERROR main ShimLoader: Error loading shims
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:143)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:371)
at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:108)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:154)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:119)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:296)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:410)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:139)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:129)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:52)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:311)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:164)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:178)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:201)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:88)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:939)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:948)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.commons.logging.LogConfigurationException: 
org.apache.commons.logging.LogConfigurationException: 
org.apache.commons.logging.LogConfigurationException: Invalid class loader 
hierarchy. You have more than one version of 'org.apache.commons.logging.Log' 
visible, which is not allowed. (Caused by 
org.apache.commons.logging.LogConfigurationException: Invalid class loader 
hierarchy. You have more than one version of 'org.apache.commons.logging.Log' 
visible, which is not allowed.) (Caused by 
org.apache.commons.logging.LogConfigurationException: 
org.apache.commons.logging.LogConfigurationException: Invalid class loader 
hierarchy. You have more than one version of 'org.apache.commons.logging.Log' 
visible, which is not allowed. (Caused by 
org.apache.commons.logging.LogConfigurationException: Invalid class loader 
hierarchy. You have more than one version of 'org.apache.commons.logging.Log' 
visible, which is not allowed.))
at 
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:543)
at 
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
at

[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2019-04-16 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819662#comment-16819662
 ] 

shahid edited comment on SPARK-16859 at 4/17/19 2:07 AM:
-

[~Hauer] [~toopt4] Can you try enabling 
"spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server 
storage tab is empty?


was (Author: shahid):
[~Hauer][~toopt4] Can you try enabling 
"spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server 
storage tab is empty?

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16859) History Server storage information is missing

2019-04-16 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819662#comment-16819662
 ] 

shahid commented on SPARK-16859:


[~Hauer][~toopt4] Can you try enabling 
"spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server 
storage tab is empty?

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27480) Improve explain output of describe query command to show the actual input query as opposed to a truncated logical plan.

2019-04-16 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-27480:


 Summary: Improve explain output of describe query command to show 
the actual input query as opposed to a truncated logical plan.
 Key: SPARK-27480
 URL: https://issues.apache.org/jira/browse/SPARK-27480
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1
Reporter: Dilip Biswal


Currently running explain on describe query gives a little confusing output. 
Instead of showing the actual query that is input by the user, it shows the 
truncated logical plan as the input. We should improve it to show the query 
text as input by user.

Here are the sample outputs of the explain command.

 
{code:java}
EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT * FROM s;
== Physical Plan ==
Execute DescribeQueryCommand
   +- DescribeQueryCommand CTE [s]
{code}
{code:java}
EXPLAIN EXTENDED DESCRIBE SELECT * from s1 where c1 > 0;
== Physical Plan ==
Execute DescribeQueryCommand
   +- DescribeQueryCommand 'Project [*]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Robert Joseph Evans (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819592#comment-16819592
 ] 

Robert Joseph Evans commented on SPARK-27396:
-

This SPIP is to put a framework in place to be able to support columnar 
processing, but not to actually implement that processing. Implementations 
would be provided by extensions.  Those extensions would have a few options on 
how to allocate memory for results and/or intermediate results.  The contract 
really is just around the ColumnarBatch and ColumnVector classes, so an 
extension could use built in implementations, similar to the on heap, off heap, 
and arrow column vector implementations in the current vesion of Spark.  We 
would provide a config for what the default should be and an API to be able to 
allocate one of these vectors based off of that config, but in some cases the 
extension may want to supply their own implementation, similar to how the ORC 
FileFormat currently does.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking

[jira] [Resolved] (SPARK-25348) Data source for binary files

2019-04-16 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25348.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24354
[https://github.com/apache/spark/pull/24354]

> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]
> Proposed API:
> Format name: "binaryFile"
> Schema:
> * content: BinaryType
> * status (following Hadoop FIleStatus):
> ** path: StringType
> ** modificationTime: Timestamp
> ** length: LongType (size limit 2GB)
> Options:
> * pathGlobFilter: only include files with path matching the glob pattern
> Input partition size can be controlled by common SQL confs: maxPartitionBytes 
> and openCostInBytes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27453) DataFrameWriter.partitionBy is Silently Dropped by DSV1

2019-04-16 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-27453.
---
   Resolution: Fixed
Fix Version/s: 2.4.2
   3.0.0

Issue resolved by pull request 24365
[https://github.com/apache/spark/pull/24365]

> DataFrameWriter.partitionBy is Silently Dropped by DSV1
> ---
>
> Key: SPARK-27453
> URL: https://issues.apache.org/jira/browse/SPARK-27453
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.4.1
>Reporter: Michael Armbrust
>Assignee: Liwen Sun
>Priority: Critical
> Fix For: 3.0.0, 2.4.2
>
>
> This is a long standing quirk of the interaction between {{DataFrameWriter}} 
> and {{CreatableRelationProvider}} (and the other forms of the DSV1 API).  
> Users can specify columns in {{partitionBy}} and our internal data sources 
> will use this information.  Unfortunately, for external systems, this data is 
> silently dropped with no feedback given to the user.
> In the long run, I think that DataSourceV2 is a better answer. However, I 
> don't think we should wait for that API to stabilize before offering some 
> kind of solution to developers of external data sources. I also do not think 
> we should break binary compatibility of this API, but I do think that  small 
> surgical fix could alleviate the issue.
> I would propose that we could propagate partitioning information (when 
> present) along with the other configuration options passed to the data source 
> in the {{String, String}} map.
> I think its very unlikely that there are both data sources that validate 
> extra options and users who are using (no-op) partitioning with them, but out 
> of an abundance of caution we should protect the behavior change behind a 
> {{legacy}} flag that can be turned off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27452) Update zstd-jni to 1.3.8-9

2019-04-16 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27452.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24364

> Update zstd-jni to 1.3.8-9
> --
>
> Key: SPARK-27452
> URL: https://issues.apache.org/jira/browse/SPARK-27452
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue updates `zstd-jni` from 1.3.2-2 to 1.3.8-7 to be aligned with the 
> latest Zstd 1.3.8 seamlessly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27467) Upgrade Maven to 3.6.1

2019-04-16 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27467.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24377

> Upgrade Maven to 3.6.1
> --
>
> Key: SPARK-27467
> URL: https://issues.apache.org/jira/browse/SPARK-27467
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aim to upgrade Maven to 3.6.1 to bring JDK9+ patches like 
> MNG-6506. For the full release note, please see the following.
> https://maven.apache.org/docs/3.6.1/release-notes.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27476) Refactoring SchemaPruning rule to remove duplicate code

2019-04-16 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27476.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24383

> Refactoring SchemaPruning rule to remove duplicate code
> ---
>
> Key: SPARK-27476
> URL: https://issues.apache.org/jira/browse/SPARK-27476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> In {{SchemaPruning}} rule, there is duplicate code for data source v1 and v2. 
> Their logic is the same and we can refactor the rule to remove duplicate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27479) Hide API docs for "org.apache.spark.util.kvstore"

2019-04-16 Thread Xiao Li (JIRA)

Xiao Li created SPARK-27479:
---

 Summary: Hide API docs for "org.apache.spark.util.kvstore"
 Key: SPARK-27479
 URL: https://issues.apache.org/jira/browse/SPARK-27479
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.1, 2.3.3, 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


The API docs should not include the "org.apache.spark.util.kvstore" package 
because they are internal private APIs.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-16 Thread Prabhjot Singh Bharaj (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819473#comment-16819473
 ] 

Prabhjot Singh Bharaj commented on SPARK-27409:
---

[~gsomogyi] I haven't tried it on master. I'm facing the problem with Spark 
2.3.2

 

Here is a complete log -

 

 
{code:java}
➜ ~/spark ((HEAD detached at v2.3.2)) ✗ ./bin/pyspark --packages 
org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2
Python 2.7.10 (default, Feb 22 2019, 21:17:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: //.ivy2/cache
The jars for the packages stored in: //.ivy2/jars
:: loading settings :: url = 
jar:file://spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-sql-kafka-0-10_2.11 added as a dependency
:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-b75b99d4-ae39-49b0-b366-8b718542b4f8;1.0
 confs: [default]
 found org.apache.spark#spark-sql-kafka-0-10_2.11;2.3.2 in central
 found org.apache.kafka#kafka-clients;0.10.0.1 in local-m2-cache
 found net.jpountz.lz4#lz4;1.3.0 in local-m2-cache
 found org.xerial.snappy#snappy-java;1.1.2.6 in local-m2-cache
 found org.slf4j#slf4j-api;1.7.16 in spark-list
 found org.spark-project.spark#unused;1.0.0 in local-m2-cache
:: resolution report :: resolve 1580ms :: artifacts dl 4ms
 :: modules in use:
 net.jpountz.lz4#lz4;1.3.0 from local-m2-cache in [default]
 org.apache.kafka#kafka-clients;0.10.0.1 from local-m2-cache in [default]
 org.apache.spark#spark-sql-kafka-0-10_2.11;2.3.2 from central in [default]
 org.slf4j#slf4j-api;1.7.16 from spark-list in [default]
 org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
 org.xerial.snappy#snappy-java;1.1.2.6 from local-m2-cache in [default]
 -
 | | modules || artifacts |
 | conf | number| search|dwnlded|evicted|| number|dwnlded|
 -
 | default | 6 | 2 | 2 | 0 || 6 | 0 |
 -

:: problems summary ::
 ERRORS
 unknown resolver null

 unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: 
org.apache.spark#spark-submit-parent-b75b99d4-ae39-49b0-b366-8b718542b4f8
 confs: [default]
 0 artifacts copied, 6 already retrieved (0kB/6ms)
19/04/16 16:31:48 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.3.2
 /_/

Using Python version 2.7.10 (default, Feb 22 2019 21:17:52)
SparkSession available as 'spark'.
>>> df = spark.readStream.format('kafka').option('kafka.bootstrap.servers', 
>>> 'localhost:9093').option("kafka.security.protocol", 
>>> "SSL",).option("kafka.ssl.keystore.password", 
>>> "").option("kafka.ssl.keystore.type", 
>>> "PKCS12").option("kafka.ssl.keystore.location", 
>>> 'non-existent').option('subscribe', 'no existing topic').load()
Traceback (most recent call last):
 File "", line 1, in 
 File "//spark/python/pyspark/sql/streaming.py", line 403, in load
 return self._df(self._jreader.load())
 File "//spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__
 File "//spark/python/pyspark/sql/utils.py", line 63, in deco
 return f(*a, **kw)
 File "//spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
: org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:702)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:557)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:540)
 at 
org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
 at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
 at 
org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
 at 
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
 at 
org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
 at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

[jira] [Created] (SPARK-27478) Make HasParallelism public?

2019-04-16 Thread Nicholas Resnick (JIRA)

Nicholas Resnick created SPARK-27478:


 Summary: Make HasParallelism public?
 Key: SPARK-27478
 URL: https://issues.apache.org/jira/browse/SPARK-27478
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.4.1
Reporter: Nicholas Resnick


I want to use HasParallelism in some custom Estimators I've written. Is it 
possible to make this trait (and the getExecutionContext method) public?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27468:
-
Description: 
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to [~srfnmnk] who reported this issue originally.

  was:
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to @dani


> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27468:
-
Description: 
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.

Credit goes to @dani

  was:
I ran the following unit test and checked the UI.
{code}
val conf = new SparkConf()
  .setAppName("test")
  .setMaster("local-cluster[2,1,1024]")
  .set("spark.ui.enabled", "true")
sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
rdd.count()
Thread.sleep(360)
{code}

The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
page.

I tried to debug and found this is because Spark emitted the following two 
events:
{code}
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
replicas),56,0))
event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
replicas),56,0))
{code}

The storage level in the second event will overwrite the first one. "1 
replicas" comes from this line: 
https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457

Maybe AppStatusListener should calculate the replicas from events?

Another fact we may need to think about is when replicas is 2, will two Spark 
events arrive in the same order? Currently, two RPCs from different executors 
can arrive in any order.


> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to @dani



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819270#comment-16819270
 ] 

Mitesh edited comment on SPARK-11033 at 4/16/19 4:59 PM:
-

Thanks [~vanzin]! One quick question...is getting this working just a matter of 
passing the real host of the launcher process to the driver process, instead of 
using the hardcoded localhost? 

https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48

Because this works in client mode, so I was very confused why it fails for 
cluster mode, until I saw the hardcoded localhost. Everything else should work 
fine?


was (Author: masterddt):
Thanks [~vanzin]! One quick question...is getting this working just a matter of 
passing the real host of the launcher process to the driver process, instead of 
using the hardcoded localhost? 

https://github.com/apache/spark/blob/2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48

Because this works in client mode, so I was very confused why it fails for 
cluster mode, until I saw the hardcoded localhost. Everything else should work 
fine?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819274#comment-16819274
 ] 

Marcelo Vanzin commented on SPARK-11033:


It's definitely not that simple.

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819270#comment-16819270
 ] 

Mitesh commented on SPARK-11033:


Thanks [~vanzin]! One quick question...is getting this working just a matter of 
passing the real host of the launcher process to the driver process, instead of 
using the hardcoded localhost? 

https://github.com/apache/spark/blob/2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48

Because this works in client mode, so I was very confused why it fails for 
cluster mode, until I saw the hardcoded localhost. Everything else should work 
fine?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Gengliang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819228#comment-16819228
 ] 

Gengliang Wang commented on SPARK-27468:


[~shahid] Thanks

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27477) Kafka token provider should have provided dependency on Spark

2019-04-16 Thread koert kuipers (JIRA)

koert kuipers created SPARK-27477:
-

 Summary: Kafka token provider should have provided dependency on 
Spark
 Key: SPARK-27477
 URL: https://issues.apache.org/jira/browse/SPARK-27477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark 3.0.0-SNAPSHOT

commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868
Author: Sean Owen 
Date:   Sat Apr 13 22:27:25 2019 +0900

[MINOR][DOCS] Fix some broken links in docs

Reporter: koert kuipers
 Fix For: 3.0.0


currently the external module spark-token-provider-kafka-0-10 has a compile 
dependency on spark-core. this means spark-sql-kafka-0-10 also has a transitive 
compile dependency on spark-core.

since spark-sql-kafka-0-10 is not bundled with spark but instead has to be 
added to an application that runs on spark this dependency should be provided, 
not compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819211#comment-16819211
 ] 

Marcelo Vanzin commented on SPARK-11033:


I just filed the bug. I have no intention of working on this.

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819181#comment-16819181
 ] 

Kazuaki Ishizaki commented on SPARK-27396:
--

I have one question regarding low-level API. In my understanding, this SPIP 
proposes code generation API for each operation at low-level for exploiting 
columnar storage. How does this SPIP support to store the result of the 
generated code into columnar storage? In particular, for {{genColumnarCode()}}.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can

[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178
 ] 

Mitesh edited comment on SPARK-11033 at 4/16/19 4:00 PM:
-

[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like SparkAppHandle.Listener is not supported yet.

Do you have a suggestion on how I can monitor the app (for success, failure, 
and "it disappeared" cases)?


was (Author: masterddt):
[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like SparkAppHandle.Listener is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178
 ] 

Mitesh edited comment on SPARK-11033 at 4/16/19 3:58 PM:
-

[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like {SparkAppHandle.Listener} is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?


was (Author: masterddt):
[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like `SparkAppHandle.Listener` is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178
 ] 

Mitesh edited comment on SPARK-11033 at 4/16/19 3:58 PM:
-

[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like SparkAppHandle.Listener is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?


was (Author: masterddt):
[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like {SparkAppHandle.Listener} is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps

2019-04-16 Thread Mitesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178
 ] 

Mitesh commented on SPARK-11033:


[~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone 
scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the 
launched app, and sounds like `SparkAppHandle.Listener` is not supported yet.

Do you have a suggestion on how I can monitor the app (both for success, 
failure, and "it disappeared" cases)?

> Launcher: add support for monitoring standalone/cluster apps
> 
>
> Key: SPARK-11033
> URL: https://issues.apache.org/jira/browse/SPARK-11033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> The backend for app monitoring in the launcher library was added in 
> SPARK-8673, but the code currently does not support standalone cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819153#comment-16819153
 ] 

Thomas Graves commented on SPARK-27396:
---

Since I don't hear any strong objections against the idea, I'm going to put the 
SPIP up for vote on the mailing list. We can continue discussions here or on 
the list.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can be problematic when we get into more extensive 
> processing. 
>  # When caching data it can optionally be cached in a columnar format if the 
>

[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-16 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819130#comment-16819130
 ] 

Sean Owen commented on SPARK-27475:
---

That seems OK if it's the only way; this script is not run frequently.

> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27476) Refactoring SchemaPruning rule to remove duplicate code

2019-04-16 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-27476:
---

 Summary: Refactoring SchemaPruning rule to remove duplicate code
 Key: SPARK-27476
 URL: https://issues.apache.org/jira/browse/SPARK-27476
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


In {{SchemaPruning}} rule, there is duplicate code for data source v1 and v2. 
Their logic is the same and we can refactor the rule to remove duplicate code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Robert Joseph Evans (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819125#comment-16819125
 ] 

Robert Joseph Evans commented on SPARK-27396:
-

[~bryanc],

I see your point that if this is for data exchange the API should be 
{{RDD[ArrowRecordBatch]}} and {{arrow_udf}}.  {{ArrowRecordBatch}} is an ipc 
class and is made for exchanging data so it should work out of the box and be 
simper for end users to deal with.  If it is for doing in place data 
processing, not sending it to another system, then I think we want something 
based off of {{ColumnarBatch}}.  Since in place columnar data processing is 
hard, initially limiting it to just the extensions API feels preferable.  If 
others are okay with that I will drop {{columnar_udf}} and 
{{RDD[ColumnarBatch]}} from this proposal, and just make sure that we have a 
good way for translating between {{ColumnarBatch}} and {{ArrowRecordBatch}} so 
we can play nicely with SPARK-24579.  In the future if we find that advanced 
users do want columnar processing UDFs we can discuss ways to properly expose 
it at that point.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of

[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-16 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819124#comment-16819124
 ] 

Yuming Wang commented on SPARK-27475:
-

[~srowen]

One way to fix it is to change [this 
line|https://github.com/apache/spark/blob/9c0af746e5dda9f05e64f0a16a3dbe11a23024de/dev/test-dependencies.sh#L71]
 to {{$MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE clean install -DskipTests 
-q}}. But it's very expensive. I‘m not sure if we have another way.

> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-16 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27475:
---

 Summary: dev/deps/spark-deps-hadoop-3.2 is incorrect
 Key: SPARK-27475
 URL: https://issues.apache.org/jira/browse/SPARK-27475
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


parquet-hadoop-bundle-1.6.0.jar should be 

parquet-hadoop-bundle-1.8.1.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-16 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27475:

Description: parquet-hadoop-bundle-1.6.0.jar should be 
parquet-hadoop-bundle-1.8.1.jar.  (was: parquet-hadoop-bundle-1.6.0.jar should 
be 

parquet-hadoop-bundle-1.8.1.jar)

> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27464) Add Constant instead of referring string literal used from many places

2019-04-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27464:
-

Assignee: Shivu Sondur

>  Add Constant instead of referring string literal used from many places
> ---
>
> Key: SPARK-27464
> URL: https://issues.apache.org/jira/browse/SPARK-27464
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shivu Sondur
>Assignee: Shivu Sondur
>Priority: Trivial
>
> Add Constant instead of referring string literal used from many places for 
> "spark.buffer.pageSize"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27464) Add Constant instead of referring string literal used from many places

2019-04-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27464.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24368
[https://github.com/apache/spark/pull/24368]

>  Add Constant instead of referring string literal used from many places
> ---
>
> Key: SPARK-27464
> URL: https://issues.apache.org/jira/browse/SPARK-27464
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shivu Sondur
>Assignee: Shivu Sondur
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Add Constant instead of referring string literal used from many places for 
> "spark.buffer.pageSize"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27469) Update Commons BeanUtils to 1.9.3

2019-04-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27469.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/24378

> Update Commons BeanUtils to 1.9.3
> -
>
> Key: SPARK-27469
> URL: https://issues.apache.org/jira/browse/SPARK-27469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Right now, Spark inherits two incosistent versions of Commons BeanUtils via 
> Hadoop: commons-beanutils 1.7.0 and commons-beanutils-core 1.8.0. Version 
> 1.9.3 is the latest, and resolves bugs and a deserialization vulnerability 
> that was otherwise resolved here in CVE-2017-12612. It'd be nice to both fix 
> the inconsistency and get the latest to further ensure that there isn't any 
> latent vulnerability here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted

2019-04-16 Thread Eyal Zituny (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Zituny updated SPARK-27330:

Description: 
in cases where a micro batch is being killed (interrupted), not during actual 
processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
{{DataWritingSparkTask}} will handle the interruption and call 
{{dataWriter.abort()}}

the problem is that {{ForeachDataWriter}} has an empty implementation for the 
abort method.

due to that, I have tasks which uses the foreach writer and according to the 
[documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 they are opening connections in the "open" method and closing the connections 
on the "close" method but since the "close" is never called, the connections 
are never closed

this wasn't the behavior pre spark 2.4

my suggestion is to call {{ForeachWriter.abort()}} when {{DataWriter.abort()}} 
is called,  in order to notify the foreach writer that this task has failed

 
{code:java}
stack trace from the exception i have encountered:
 org.apache.spark.TaskKilledException: null
 at 
org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
{code}
 

  was:
in cases where a micro batch is being killed (interrupted), not during actual 
processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
{{DataWritingSparkTask}} will handle the interruption and call 
{{dataWriter.abort()}}

the problem is that {{ForeachDataWriter}} has an empty implementation for the 
abort method.

due to that, I have tasks which uses the foreach writer and according to the 
[documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 they are opening connections in the "open" method and closing the connections 
on the "close" method but since the "close" is never called, the connections 
are never closed

this wasn't the behavior pre spark 2.4

my suggestion is to call {{ForeachWriter.close()}} when {{DataWriter.abort()}} 
is called, and exception should also be provided in order to notify the foreach 
writer that this task has failed

 
{code}
stack trace from the exception i have encountered:
 org.apache.spark.TaskKilledException: null
 at 
org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
 at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
 at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
{code}

 


> ForeachWriter is not being closed once a batch is aborted
> -
>
> Key: SPARK-27330
> URL: https://issues.apache.org/jira/browse/SPARK-27330
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Eyal Zituny
>Priority: Major
>
> in cases where a micro batch is being killed (interrupted), not during actual 
> processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
> {{DataWritingSparkTask}} will handle the interruption and call 
> {{dataWriter.abort()}}
> the problem is that {{ForeachDataWriter}} has an empty implementation for the 
> abort method.
> due

[jira] [Resolved] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts

2019-04-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27397.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24308
[https://github.com/apache/spark/pull/24308]

> Take care of OpenJ9 in JVM dependant parts
> --
>
> Key: SPARK-27397
> URL: https://issues.apache.org/jira/browse/SPARK-27397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The 
> current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been 
> released. However, it is not considered yet. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted

2019-04-16 Thread Eyal Zituny (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819054#comment-16819054
 ] 

Eyal Zituny commented on SPARK-27330:
-

[~gsomogyi] i've submitted a PR

> ForeachWriter is not being closed once a batch is aborted
> -
>
> Key: SPARK-27330
> URL: https://issues.apache.org/jira/browse/SPARK-27330
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Eyal Zituny
>Priority: Major
>
> in cases where a micro batch is being killed (interrupted), not during actual 
> processing done by the {{ForeachDataWriter}} (when iterating the iterator), 
> {{DataWritingSparkTask}} will handle the interruption and call 
> {{dataWriter.abort()}}
> the problem is that {{ForeachDataWriter}} has an empty implementation for the 
> abort method.
> due to that, I have tasks which uses the foreach writer and according to the 
> [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  they are opening connections in the "open" method and closing the 
> connections on the "close" method but since the "close" is never called, the 
> connections are never closed
> this wasn't the behavior pre spark 2.4
> my suggestion is to call {{ForeachWriter.close()}} when 
> {{DataWriter.abort()}} is called, and exception should also be provided in 
> order to notify the foreach writer that this task has failed
>  
> {code}
> stack trace from the exception i have encountered:
>  org.apache.spark.TaskKilledException: null
>  at 
> org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts

2019-04-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27397:
-

Assignee: Kazuaki Ishizaki

> Take care of OpenJ9 in JVM dependant parts
> --
>
> Key: SPARK-27397
> URL: https://issues.apache.org/jira/browse/SPARK-27397
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The 
> current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been 
> released. However, it is not considered yet. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-16 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818963#comment-16818963
 ] 

Gabor Somogyi commented on SPARK-27409:
---

After I've modified the code the same way just like in scala case and checked 
it with 2.3.2 pyspark here are my thoughts.
 `trigger=continuous...` controls mainly the execution (MicroBatchExecution vs 
ContinuousExecution). See the code 
[here|https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L243]
 This code is significantly re-written on master branch. Does this cause any 
issue?

> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
> E ... 23 more
> E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
> directory)
> E at java.io.FileInputStream.open0(Native Method)
> E at java.io.FileInputStream.open(FileInputStream.java:195)
> E at java.io.FileInputStream.(FileInputStream.java:138)
> E at java.io.FileInputStream.(FileInputStream.java:93)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
> E ... 24 more{code}
>  When running a simple data stream loader for kafka

[jira] [Updated] (SPARK-27474) avoid retrying a task failed with CommitDeniedException many times

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27474:

Summary: avoid retrying a task failed with CommitDeniedException many times 
 (was: try best to not submit tasks when the partitions are already completed)

> avoid retrying a task failed with CommitDeniedException many times
> --
>
> Key: SPARK-27474
> URL: https://issues.apache.org/jira/browse/SPARK-27474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27470) Upgrade pyrolite to 4.23

2019-04-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27470.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24381
[https://github.com/apache/spark/pull/24381]

> Upgrade pyrolite to 4.23
> 
>
> Key: SPARK-27470
> URL: https://issues.apache.org/jira/browse/SPARK-27470
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> We can/should upgrade the pyrolite dependence to the latest, 4.23, to pick up 
> bug and security fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-16 Thread Daniel Tomes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818889#comment-16818889
 ] 

Daniel Tomes commented on SPARK-27468:
--

Excellent [~shahid], if you need any assistance replicating, let me know; I can 
recreate the issue but you should be able to as well.

 

Thanks

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-16 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818882#comment-16818882
 ] 

Gabor Somogyi commented on SPARK-27409:
---

[~pbharaj]
{quote}Yes, I'm following the kafka integration guide linked.{quote}

{code:java}
$ ./bin/pyspark
Python 2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
19/04/16 12:24:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
  /_/

Using Python version 2.7.10 (default, Aug 17 2018 17:41:52)
SparkSession available as 'spark'.
>>> df = sc.sql.readStream.format('kafka').option('kafka.bootstrap.servers', 
>>> 'localhost:9093').option("kafka.security.protocol", 
>>> "SSL",).option("kafka.ssl.keystore.password", 
>>> "").option("kafka.ssl.keystore.type", 
>>> "PKCS12").option("kafka.ssl.keystore.location", 
>>> 'non-existent').option('subscribe', 'no existing topic').load()
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'SparkContext' object has no attribute 'sql'
>>>
{code}


> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
>

[jira] [Created] (SPARK-27474) try best to not submit tasks when the partitions are already completed

2019-04-16 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27474:
---

 Summary: try best to not submit tasks when the partitions are 
already completed
 Key: SPARK-27474
 URL: https://issues.apache.org/jira/browse/SPARK-27474
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2019-04-16 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818707#comment-16818707
 ] 

Wenchen Fan commented on SPARK-25250:
-

good points! I'm creating another ticket for the issue, and leave this as 
resolved.

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25250.
-
Resolution: Fixed

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 3.0.0, 2.4.1
>
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27448) File source V2 table provider should be compatible with V1 provider

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27448.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24356
[https://github.com/apache/spark/pull/24356]

> File source V2 table provider should be compatible with V1 provider
> ---
>
> Key: SPARK-27448
> URL: https://issues.apache.org/jira/browse/SPARK-27448
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In the rule `PreprocessTableCreation`, if an existing table is appended with 
> a different provider, the action will fail.
> Currently, there are two implementations for file sources and creating a 
> table with file source V2 will always fall back to V1 FileFormat. We should 
> consider the following cases as valid:
> 1. Appending a table with file source V2 provider using the v1 file format
> 2. Appending a table with v1 file format provider using file source V2 format 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27448) File source V2 table provider should be compatible with V1 provider

2019-04-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27448:
---

Assignee: Gengliang Wang

> File source V2 table provider should be compatible with V1 provider
> ---
>
> Key: SPARK-27448
> URL: https://issues.apache.org/jira/browse/SPARK-27448
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> In the rule `PreprocessTableCreation`, if an existing table is appended with 
> a different provider, the action will fail.
> Currently, there are two implementations for file sources and creating a 
> table with file source V2 will always fall back to V1 FileFormat. We should 
> consider the following cases as valid:
> 1. Appending a table with file source V2 provider using the v1 file format
> 2. Appending a table with v1 file format provider using file source V2 format 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27465) Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package

2019-04-16 Thread Praveen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen updated SPARK-27465:

Description: 
Hi Team,

We are getting the below exceptions with Kafka Client Version 0.11.0.0 for 
KafkaTestUtils Package. But its working fine when we use the Kafka Client 
Version 0.10.0.1. Please suggest the way forwards. We are using the package "

org.apache.spark.streaming.kafka010.KafkaTestUtils;"

And the Spark Streaming Version is 2.2.3 and above.

 

ERROR:

java.lang.NoSuchMethodError: 
kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time;
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107)
 at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122)
 at 
com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203)
 at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157)
 at 
com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58)

  was:
Hi Team,

We are getting the below exceptions with Kafka Client Version 0.11.0.0 for 
KafkaTestUtils Package. But its working fine when we use the Kafka Client 
Version 0.10.0.1. Please suggest the way forwards. We are using the package "

import org.apache.spark.streaming.kafka010.KafkaTestUtils;"

 

ERROR:

java.lang.NoSuchMethodError: 
kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time;
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107)
 at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107)
 at 
org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122)
 at 
com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203)
 at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157)
 at 
com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58)


> Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package
> --
>
> Key: SPARK-27465
> URL: https://issues.apache.org/jira/browse/SPARK-27465
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Praveen
>Priority: Critical
>
> Hi Team,
> We are getting the below exceptions with Kafka Client Version 0.11.0.0 for 
> KafkaTestUtils Package. But its working fine when we use the Kafka Client 
> Version 0.10.0.1. Please suggest the way forwards. We are using the package "
> org.apache.spark.streaming.kafka010.KafkaTestUtils;"
> And the Spark Streaming Version is 2.2.3 and above.
>  
> ERROR:
> java.lang.NoSuchMethodError: 
> kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time;
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234)
>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>  at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107)
>  at 
> org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122)
>  at 
> com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203)
>  at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157)
>  at 
> com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

65 matches

Mail list logo