[jira] [Assigned] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27416: --- Assignee: peng bo > UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines > have different Oops size > > > Key: SPARK-27416 > URL: https://issues.apache.org/jira/browse/SPARK-27416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Assignee: peng bo >Priority: Major > > Actually this's follow up for > https://issues.apache.org/jira/browse/SPARK-27406, > https://issues.apache.org/jira/browse/SPARK-10914 > This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization > issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27416. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24357 [https://github.com/apache/spark/pull/24357] > UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines > have different Oops size > > > Key: SPARK-27416 > URL: https://issues.apache.org/jira/browse/SPARK-27416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Assignee: peng bo >Priority: Major > Fix For: 3.0.0 > > > Actually this's follow up for > https://issues.apache.org/jira/browse/SPARK-27406, > https://issues.apache.org/jira/browse/SPARK-10914 > This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization > issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27485) Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1
Muthu Jayakumar created SPARK-27485: --- Summary: Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1 Key: SPARK-27485 URL: https://issues.apache.org/jira/browse/SPARK-27485 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 2.4.0 Reporter: Muthu Jayakumar Certain queries fail with {noformat} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:349) at scala.None$.get(Option.scala:347) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1(EnsureRequirements.scala:238) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1$adapted(EnsureRequirements.scala:233) at scala.collection.immutable.List.foreach(List.scala:388) at org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:233) at org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:262) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:289) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$4(TreeNode.scala:282) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:282) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38) at org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:87) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:122) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:118) at scala.collection.immutable.List.foldLeft(List.scala:85) {noformat} I don't have an exact query reproducer for this. But, I can try to frame one if this problem hasn't been reported in the past? -- This message was sent by Atlassian
[jira] [Assigned] (SPARK-27483) move the data source v2 fallback to v1 logic to an analyzer rule
[ https://issues.apache.org/jira/browse/SPARK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27483: --- Assignee: (was: Wenchen Fan) > move the data source v2 fallback to v1 logic to an analyzer rule > > > Key: SPARK-27483 > URL: https://issues.apache.org/jira/browse/SPARK-27483 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27484) create the streaming writing logical plan node before query is analyzed
Wenchen Fan created SPARK-27484: --- Summary: create the streaming writing logical plan node before query is analyzed Key: SPARK-27484 URL: https://issues.apache.org/jira/browse/SPARK-27484 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27483) move the data source v2 fallback to v1 logic to an analyzer rule
Wenchen Fan created SPARK-27483: --- Summary: move the data source v2 fallback to v1 logic to an analyzer rule Key: SPARK-27483 URL: https://issues.apache.org/jira/browse/SPARK-27483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
[ https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27482: Description: Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each {{SparkPlan}} node. However with {{statistics}} info may help us understand how the plan is designed and the reason it runs slowly. This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} info on {{SparkSQL}} UI page first when it's available. was: Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each {{SparkPlan}} node. However with {{statistics}} info may help us understand how the plan is designed and the reason it runs slowly. This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} info on {{SparkSQL}} UI page first. > Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page > > > Key: SPARK-27482 > URL: https://issues.apache.org/jira/browse/SPARK-27482 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: peng bo >Priority: Major > Attachments: SPARK-27482-1.png > > > Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each > {{SparkPlan}} node. However with {{statistics}} info may help us understand > how the plan is designed and the reason it runs slowly. > This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} > info on {{SparkSQL}} UI page first when it's available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page
[ https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27482: Summary: Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page (was: Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page) > Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page > > > Key: SPARK-27482 > URL: https://issues.apache.org/jira/browse/SPARK-27482 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: peng bo >Priority: Major > Attachments: SPARK-27482-1.png > > > Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each > {{SparkPlan}} node. However with {{statistics}} info may help us understand > how the plan is designed and the reason it runs slowly. > This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} > info on {{SparkSQL}} UI page first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page
[ https://issues.apache.org/jira/browse/SPARK-27482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27482: Attachment: SPARK-27482-1.png > Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page > --- > > Key: SPARK-27482 > URL: https://issues.apache.org/jira/browse/SPARK-27482 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: peng bo >Priority: Major > Attachments: SPARK-27482-1.png > > > Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each > {{SparkPlan}} node. However with {{statistics}} info may help us understand > how the plan is designed and the reason it runs slowly. > This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} > info on {{SparkSQL}} UI page first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27482) Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page
peng bo created SPARK-27482: --- Summary: Show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page Key: SPARK-27482 URL: https://issues.apache.org/jira/browse/SPARK-27482 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: peng bo Currently, the {{SparkSQL}} UI page shows only actual {{metric}} info in each {{SparkPlan}} node. However with {{statistics}} info may help us understand how the plan is designed and the reason it runs slowly. This issue is to show {{BroadcastHashJoinExec}} {{numOutputRows statistic}} info on {{SparkSQL}} UI page first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27479) Hide API docs for "org.apache.spark.util.kvstore"
[ https://issues.apache.org/jira/browse/SPARK-27479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27479. - Resolution: Fixed Fix Version/s: 2.4.2 > Hide API docs for "org.apache.spark.util.kvstore" > - > > Key: SPARK-27479 > URL: https://issues.apache.org/jira/browse/SPARK-27479 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.3, 2.4.1, 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.2 > > > The API docs should not include the "org.apache.spark.util.kvstore" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819679#comment-16819679 ] shahid commented on SPARK-27468: Hi [~srfnmnk], I tried to reproduce in the master branch. The steps followed shown below 1) bin/spark-shell --master local[2] {code:java} scala> import org.apache.spark.storage.StorageLevel scala> val rdd = sc.parallelize(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) scala > rdd.count {code} Storage tab in the UI is shown below , !Screenshot from 2019-04-17 10-42-55.png! So, it seems I am not able to reproduce the issue. Could you please tell me if the test steps are correct or I need to enable any configurations. Thank you > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > Attachments: Screenshot from 2019-04-17 10-42-55.png > > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to [~srfnmnk] who reported this issue originally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-27468: --- Attachment: Screenshot from 2019-04-17 10-42-55.png > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > Attachments: Screenshot from 2019-04-17 10-42-55.png > > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to [~srfnmnk] who reported this issue originally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Zhang updated SPARK-24630: Comment: was deleted (was: thanks [~Jackey Lee] So I'm wondering what's blocking the pr of this issue to be merged, is it related to DataSourceV2?) > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP V2.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27481) Upgrade commons-logging to 1.1.3
Yuming Wang created SPARK-27481: --- Summary: Upgrade commons-logging to 1.1.3 Key: SPARK-27481 URL: https://issues.apache.org/jira/browse/SPARK-27481 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.0.0 Reporter: Yuming Wang We may hint the {{LogConfigurationException}} when upgrade built-in Hive to 2.3.4: {noformat} bin/spark-sql --conf spark.sql.hive.metastore.version=1.2.2 --conf spark.sql.hive.metastore.jars=file:///apache/hive-1.2.2-bin/lib/* ... 19/04/16 19:04:06 ERROR main ShimLoader: Error loading shims java.lang.ExceptionInInitializerError at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:143) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:371) at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:108) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:154) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:119) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:296) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:410) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:139) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:129) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:52) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:311) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:164) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:178) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:201) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:88) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:939) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:948) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.commons.logging.LogConfigurationException: org.apache.commons.logging.LogConfigurationException: org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy. You have more than one version of 'org.apache.commons.logging.Log' visible, which is not allowed. (Caused by org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy. You have more than one version of 'org.apache.commons.logging.Log' visible, which is not allowed.) (Caused by org.apache.commons.logging.LogConfigurationException: org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy. You have more than one version of 'org.apache.commons.logging.Log' visible, which is not allowed. (Caused by org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy. You have more than one version of 'org.apache.commons.logging.Log' visible, which is not allowed.)) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:543) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235) at
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819662#comment-16819662 ] shahid edited comment on SPARK-16859 at 4/17/19 2:07 AM: - [~Hauer] [~toopt4] Can you try enabling "spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server storage tab is empty? was (Author: shahid): [~Hauer][~toopt4] Can you try enabling "spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server storage tab is empty? > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819662#comment-16819662 ] shahid commented on SPARK-16859: [~Hauer][~toopt4] Can you try enabling "spark.eventLog.logBlockUpdates.enabled=true" and see, if still History server storage tab is empty? > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27480) Improve explain output of describe query command to show the actual input query as opposed to a truncated logical plan.
Dilip Biswal created SPARK-27480: Summary: Improve explain output of describe query command to show the actual input query as opposed to a truncated logical plan. Key: SPARK-27480 URL: https://issues.apache.org/jira/browse/SPARK-27480 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Reporter: Dilip Biswal Currently running explain on describe query gives a little confusing output. Instead of showing the actual query that is input by the user, it shows the truncated logical plan as the input. We should improve it to show the query text as input by user. Here are the sample outputs of the explain command. {code:java} EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT * FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand CTE [s] {code} {code:java} EXPLAIN EXTENDED DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand 'Project [*] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819592#comment-16819592 ] Robert Joseph Evans commented on SPARK-27396: - This SPIP is to put a framework in place to be able to support columnar processing, but not to actually implement that processing. Implementations would be provided by extensions. Those extensions would have a few options on how to allocate memory for results and/or intermediate results. The contract really is just around the ColumnarBatch and ColumnVector classes, so an extension could use built in implementations, similar to the on heap, off heap, and arrow column vector implementations in the current vesion of Spark. We would provide a config for what the default should be and an API to be able to allocate one of these vectors based off of that config, but in some cases the extension may want to supply their own implementation, similar to how the ORC FileFormat currently does. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking
[jira] [Resolved] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-25348. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24354 [https://github.com/apache/spark/pull/24354] > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binaryFile" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modificationTime: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathGlobFilter: only include files with path matching the glob pattern > Input partition size can be controlled by common SQL confs: maxPartitionBytes > and openCostInBytes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27453) DataFrameWriter.partitionBy is Silently Dropped by DSV1
[ https://issues.apache.org/jira/browse/SPARK-27453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-27453. --- Resolution: Fixed Fix Version/s: 2.4.2 3.0.0 Issue resolved by pull request 24365 [https://github.com/apache/spark/pull/24365] > DataFrameWriter.partitionBy is Silently Dropped by DSV1 > --- > > Key: SPARK-27453 > URL: https://issues.apache.org/jira/browse/SPARK-27453 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.4.1 >Reporter: Michael Armbrust >Assignee: Liwen Sun >Priority: Critical > Fix For: 3.0.0, 2.4.2 > > > This is a long standing quirk of the interaction between {{DataFrameWriter}} > and {{CreatableRelationProvider}} (and the other forms of the DSV1 API). > Users can specify columns in {{partitionBy}} and our internal data sources > will use this information. Unfortunately, for external systems, this data is > silently dropped with no feedback given to the user. > In the long run, I think that DataSourceV2 is a better answer. However, I > don't think we should wait for that API to stabilize before offering some > kind of solution to developers of external data sources. I also do not think > we should break binary compatibility of this API, but I do think that small > surgical fix could alleviate the issue. > I would propose that we could propagate partitioning information (when > present) along with the other configuration options passed to the data source > in the {{String, String}} map. > I think its very unlikely that there are both data sources that validate > extra options and users who are using (no-op) partitioning with them, but out > of an abundance of caution we should protect the behavior change behind a > {{legacy}} flag that can be turned off. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27452) Update zstd-jni to 1.3.8-9
[ https://issues.apache.org/jira/browse/SPARK-27452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27452. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24364 > Update zstd-jni to 1.3.8-9 > -- > > Key: SPARK-27452 > URL: https://issues.apache.org/jira/browse/SPARK-27452 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > This issue updates `zstd-jni` from 1.3.2-2 to 1.3.8-7 to be aligned with the > latest Zstd 1.3.8 seamlessly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27467) Upgrade Maven to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-27467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27467. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24377 > Upgrade Maven to 3.6.1 > -- > > Key: SPARK-27467 > URL: https://issues.apache.org/jira/browse/SPARK-27467 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aim to upgrade Maven to 3.6.1 to bring JDK9+ patches like > MNG-6506. For the full release note, please see the following. > https://maven.apache.org/docs/3.6.1/release-notes.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27476) Refactoring SchemaPruning rule to remove duplicate code
[ https://issues.apache.org/jira/browse/SPARK-27476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27476. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24383 > Refactoring SchemaPruning rule to remove duplicate code > --- > > Key: SPARK-27476 > URL: https://issues.apache.org/jira/browse/SPARK-27476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > In {{SchemaPruning}} rule, there is duplicate code for data source v1 and v2. > Their logic is the same and we can refactor the rule to remove duplicate code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27479) Hide API docs for "org.apache.spark.util.kvstore"
Xiao Li created SPARK-27479: --- Summary: Hide API docs for "org.apache.spark.util.kvstore" Key: SPARK-27479 URL: https://issues.apache.org/jira/browse/SPARK-27479 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.1, 2.3.3, 3.0.0 Reporter: Xiao Li Assignee: Xiao Li The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819473#comment-16819473 ] Prabhjot Singh Bharaj commented on SPARK-27409: --- [~gsomogyi] I haven't tried it on master. I'm facing the problem with Spark 2.3.2 Here is a complete log - {code:java} ➜ ~/spark ((HEAD detached at v2.3.2)) ✗ ./bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 Python 2.7.10 (default, Feb 22 2019, 21:17:52) [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin Type "help", "copyright", "credits" or "license" for more information. Ivy Default Cache set to: //.ivy2/cache The jars for the packages stored in: //.ivy2/jars :: loading settings :: url = jar:file://spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.spark#spark-sql-kafka-0-10_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-b75b99d4-ae39-49b0-b366-8b718542b4f8;1.0 confs: [default] found org.apache.spark#spark-sql-kafka-0-10_2.11;2.3.2 in central found org.apache.kafka#kafka-clients;0.10.0.1 in local-m2-cache found net.jpountz.lz4#lz4;1.3.0 in local-m2-cache found org.xerial.snappy#snappy-java;1.1.2.6 in local-m2-cache found org.slf4j#slf4j-api;1.7.16 in spark-list found org.spark-project.spark#unused;1.0.0 in local-m2-cache :: resolution report :: resolve 1580ms :: artifacts dl 4ms :: modules in use: net.jpountz.lz4#lz4;1.3.0 from local-m2-cache in [default] org.apache.kafka#kafka-clients;0.10.0.1 from local-m2-cache in [default] org.apache.spark#spark-sql-kafka-0-10_2.11;2.3.2 from central in [default] org.slf4j#slf4j-api;1.7.16 from spark-list in [default] org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default] org.xerial.snappy#snappy-java;1.1.2.6 from local-m2-cache in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 6 | 2 | 2 | 0 || 6 | 0 | - :: problems summary :: ERRORS unknown resolver null unknown resolver null :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS :: retrieving :: org.apache.spark#spark-submit-parent-b75b99d4-ae39-49b0-b366-8b718542b4f8 confs: [default] 0 artifacts copied, 6 already retrieved (0kB/6ms) 19/04/16 16:31:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Python version 2.7.10 (default, Feb 22 2019 21:17:52) SparkSession available as 'spark'. >>> df = spark.readStream.format('kafka').option('kafka.bootstrap.servers', >>> 'localhost:9093').option("kafka.security.protocol", >>> "SSL",).option("kafka.ssl.keystore.password", >>> "").option("kafka.ssl.keystore.type", >>> "PKCS12").option("kafka.ssl.keystore.location", >>> 'non-existent').option('subscribe', 'no existing topic').load() Traceback (most recent call last): File "", line 1, in File "//spark/python/pyspark/sql/streaming.py", line 403, in load return self._df(self._jreader.load()) File "//spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "//spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "//spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o37.load. : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:702) at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:557) at org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:540) at org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) at org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) at org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) at org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) at org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Created] (SPARK-27478) Make HasParallelism public?
Nicholas Resnick created SPARK-27478: Summary: Make HasParallelism public? Key: SPARK-27478 URL: https://issues.apache.org/jira/browse/SPARK-27478 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.4.1 Reporter: Nicholas Resnick I want to use HasParallelism in some custom Estimators I've written. Is it possible to make this trait (and the getExecutionContext method) public? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-27468: - Description: I ran the following unit test and checked the UI. {code} val conf = new SparkConf() .setAppName("test") .setMaster("local-cluster[2,1,1024]") .set("spark.ui.enabled", "true") sc = new SparkContext(conf) val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) rdd.count() Thread.sleep(360) {code} The storage level is "Memory Deserialized 1x Replicated" in the RDD storage page. I tried to debug and found this is because Spark emitted the following two events: {code} event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 replicas),56,0)) event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 replicas),56,0)) {code} The storage level in the second event will overwrite the first one. "1 replicas" comes from this line: https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 Maybe AppStatusListener should calculate the replicas from events? Another fact we may need to think about is when replicas is 2, will two Spark events arrive in the same order? Currently, two RPCs from different executors can arrive in any order. Credit goes to [~srfnmnk] who reported this issue originally. was: I ran the following unit test and checked the UI. {code} val conf = new SparkConf() .setAppName("test") .setMaster("local-cluster[2,1,1024]") .set("spark.ui.enabled", "true") sc = new SparkContext(conf) val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) rdd.count() Thread.sleep(360) {code} The storage level is "Memory Deserialized 1x Replicated" in the RDD storage page. I tried to debug and found this is because Spark emitted the following two events: {code} event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 replicas),56,0)) event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 replicas),56,0)) {code} The storage level in the second event will overwrite the first one. "1 replicas" comes from this line: https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 Maybe AppStatusListener should calculate the replicas from events? Another fact we may need to think about is when replicas is 2, will two Spark events arrive in the same order? Currently, two RPCs from different executors can arrive in any order. Credit goes to @dani > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to [~srfnmnk] who reported this issue originally. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-27468: - Description: I ran the following unit test and checked the UI. {code} val conf = new SparkConf() .setAppName("test") .setMaster("local-cluster[2,1,1024]") .set("spark.ui.enabled", "true") sc = new SparkContext(conf) val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) rdd.count() Thread.sleep(360) {code} The storage level is "Memory Deserialized 1x Replicated" in the RDD storage page. I tried to debug and found this is because Spark emitted the following two events: {code} event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 replicas),56,0)) event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 replicas),56,0)) {code} The storage level in the second event will overwrite the first one. "1 replicas" comes from this line: https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 Maybe AppStatusListener should calculate the replicas from events? Another fact we may need to think about is when replicas is 2, will two Spark events arrive in the same order? Currently, two RPCs from different executors can arrive in any order. Credit goes to @dani was: I ran the following unit test and checked the UI. {code} val conf = new SparkConf() .setAppName("test") .setMaster("local-cluster[2,1,1024]") .set("spark.ui.enabled", "true") sc = new SparkContext(conf) val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) rdd.count() Thread.sleep(360) {code} The storage level is "Memory Deserialized 1x Replicated" in the RDD storage page. I tried to debug and found this is because Spark emitted the following two events: {code} event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 replicas),56,0)) event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 replicas),56,0)) {code} The storage level in the second event will overwrite the first one. "1 replicas" comes from this line: https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 Maybe AppStatusListener should calculate the replicas from events? Another fact we may need to think about is when replicas is 2, will two Spark events arrive in the same order? Currently, two RPCs from different executors can arrive in any order. > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. > Credit goes to @dani -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819270#comment-16819270 ] Mitesh edited comment on SPARK-11033 at 4/16/19 4:59 PM: - Thanks [~vanzin]! One quick question...is getting this working just a matter of passing the real host of the launcher process to the driver process, instead of using the hardcoded localhost? https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48 Because this works in client mode, so I was very confused why it fails for cluster mode, until I saw the hardcoded localhost. Everything else should work fine? was (Author: masterddt): Thanks [~vanzin]! One quick question...is getting this working just a matter of passing the real host of the launcher process to the driver process, instead of using the hardcoded localhost? https://github.com/apache/spark/blob/2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48 Because this works in client mode, so I was very confused why it fails for cluster mode, until I saw the hardcoded localhost. Everything else should work fine? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819274#comment-16819274 ] Marcelo Vanzin commented on SPARK-11033: It's definitely not that simple. > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819270#comment-16819270 ] Mitesh commented on SPARK-11033: Thanks [~vanzin]! One quick question...is getting this working just a matter of passing the real host of the launcher process to the driver process, instead of using the hardcoded localhost? https://github.com/apache/spark/blob/2.3/core/src/main/scala/org/apache/spark/launcher/LauncherBackend.scala#L48 Because this works in client mode, so I was very confused why it fails for cluster mode, until I saw the hardcoded localhost. Everything else should work fine? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819228#comment-16819228 ] Gengliang Wang commented on SPARK-27468: [~shahid] Thanks > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27477) Kafka token provider should have provided dependency on Spark
koert kuipers created SPARK-27477: - Summary: Kafka token provider should have provided dependency on Spark Key: SPARK-27477 URL: https://issues.apache.org/jira/browse/SPARK-27477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: spark 3.0.0-SNAPSHOT commit 38fc8e2484aa4971d1f2c115da61fc96f36e7868 Author: Sean Owen Date: Sat Apr 13 22:27:25 2019 +0900 [MINOR][DOCS] Fix some broken links in docs Reporter: koert kuipers Fix For: 3.0.0 currently the external module spark-token-provider-kafka-0-10 has a compile dependency on spark-core. this means spark-sql-kafka-0-10 also has a transitive compile dependency on spark-core. since spark-sql-kafka-0-10 is not bundled with spark but instead has to be added to an application that runs on spark this dependency should be provided, not compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819211#comment-16819211 ] Marcelo Vanzin commented on SPARK-11033: I just filed the bug. I have no intention of working on this. > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819181#comment-16819181 ] Kazuaki Ishizaki commented on SPARK-27396: -- I have one question regarding low-level API. In my understanding, this SPIP proposes code generation API for each operation at low-level for exploiting columnar storage. How does this SPIP support to store the result of the generated code into columnar storage? In particular, for {{genColumnarCode()}}. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can
[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178 ] Mitesh edited comment on SPARK-11033 at 4/16/19 4:00 PM: - [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like SparkAppHandle.Listener is not supported yet. Do you have a suggestion on how I can monitor the app (for success, failure, and "it disappeared" cases)? was (Author: masterddt): [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like SparkAppHandle.Listener is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178 ] Mitesh edited comment on SPARK-11033 at 4/16/19 3:58 PM: - [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like {SparkAppHandle.Listener} is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? was (Author: masterddt): [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like `SparkAppHandle.Listener` is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178 ] Mitesh edited comment on SPARK-11033 at 4/16/19 3:58 PM: - [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like SparkAppHandle.Listener is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? was (Author: masterddt): [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like {SparkAppHandle.Listener} is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11033) Launcher: add support for monitoring standalone/cluster apps
[ https://issues.apache.org/jira/browse/SPARK-11033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819178#comment-16819178 ] Mitesh commented on SPARK-11033: [~vanzin] Is there any plan to get this working? I'm on 2.3.2 using standalone scheduler, clustermode, and InProcessLauncher. We need a way to to monitor the launched app, and sounds like `SparkAppHandle.Listener` is not supported yet. Do you have a suggestion on how I can monitor the app (both for success, failure, and "it disappeared" cases)? > Launcher: add support for monitoring standalone/cluster apps > > > Key: SPARK-11033 > URL: https://issues.apache.org/jira/browse/SPARK-11033 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > The backend for app monitoring in the launcher library was added in > SPARK-8673, but the code currently does not support standalone cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819153#comment-16819153 ] Thomas Graves commented on SPARK-27396: --- Since I don't hear any strong objections against the idea, I'm going to put the SPIP up for vote on the mailing list. We can continue discussions here or on the list. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to implement the same functionality for different > processing because they can only do it through code generation. There really > is no clean separate path in the code generation for columnar vs row based. > Additionally because it is only supported through code generation if for any > reason code generation would fail there is no backup. This is typically fine > for input formats but can be problematic when we get into more extensive > processing. > # When caching data it can optionally be cached in a columnar format if the >
[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819130#comment-16819130 ] Sean Owen commented on SPARK-27475: --- That seems OK if it's the only way; this script is not run frequently. > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27476) Refactoring SchemaPruning rule to remove duplicate code
Liang-Chi Hsieh created SPARK-27476: --- Summary: Refactoring SchemaPruning rule to remove duplicate code Key: SPARK-27476 URL: https://issues.apache.org/jira/browse/SPARK-27476 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh In {{SchemaPruning}} rule, there is duplicate code for data source v1 and v2. Their logic is the same and we can refactor the rule to remove duplicate code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819125#comment-16819125 ] Robert Joseph Evans commented on SPARK-27396: - [~bryanc], I see your point that if this is for data exchange the API should be {{RDD[ArrowRecordBatch]}} and {{arrow_udf}}. {{ArrowRecordBatch}} is an ipc class and is made for exchanging data so it should work out of the box and be simper for end users to deal with. If it is for doing in place data processing, not sending it to another system, then I think we want something based off of {{ColumnarBatch}}. Since in place columnar data processing is hard, initially limiting it to just the extensions API feels preferable. If others are okay with that I will drop {{columnar_udf}} and {{RDD[ColumnarBatch]}} from this proposal, and just make sure that we have a good way for translating between {{ColumnarBatch}} and {{ArrowRecordBatch}} so we can play nicely with SPARK-24579. In the future if we find that advanced users do want columnar processing UDFs we can discuss ways to properly expose it at that point. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of
[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819124#comment-16819124 ] Yuming Wang commented on SPARK-27475: - [~srowen] One way to fix it is to change [this line|https://github.com/apache/spark/blob/9c0af746e5dda9f05e64f0a16a3dbe11a23024de/dev/test-dependencies.sh#L71] to {{$MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE clean install -DskipTests -q}}. But it's very expensive. I‘m not sure if we have another way. > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
Yuming Wang created SPARK-27475: --- Summary: dev/deps/spark-deps-hadoop-3.2 is incorrect Key: SPARK-27475 URL: https://issues.apache.org/jira/browse/SPARK-27475 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Yuming Wang parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27475: Description: parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. (was: parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar) > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27464) Add Constant instead of referring string literal used from many places
[ https://issues.apache.org/jira/browse/SPARK-27464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27464: - Assignee: Shivu Sondur > Add Constant instead of referring string literal used from many places > --- > > Key: SPARK-27464 > URL: https://issues.apache.org/jira/browse/SPARK-27464 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shivu Sondur >Assignee: Shivu Sondur >Priority: Trivial > > Add Constant instead of referring string literal used from many places for > "spark.buffer.pageSize" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27464) Add Constant instead of referring string literal used from many places
[ https://issues.apache.org/jira/browse/SPARK-27464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27464. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24368 [https://github.com/apache/spark/pull/24368] > Add Constant instead of referring string literal used from many places > --- > > Key: SPARK-27464 > URL: https://issues.apache.org/jira/browse/SPARK-27464 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shivu Sondur >Assignee: Shivu Sondur >Priority: Trivial > Fix For: 3.0.0 > > > Add Constant instead of referring string literal used from many places for > "spark.buffer.pageSize" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27469) Update Commons BeanUtils to 1.9.3
[ https://issues.apache.org/jira/browse/SPARK-27469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27469. --- Resolution: Fixed Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark/pull/24378 > Update Commons BeanUtils to 1.9.3 > - > > Key: SPARK-27469 > URL: https://issues.apache.org/jira/browse/SPARK-27469 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.1, 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > Right now, Spark inherits two incosistent versions of Commons BeanUtils via > Hadoop: commons-beanutils 1.7.0 and commons-beanutils-core 1.8.0. Version > 1.9.3 is the latest, and resolves bugs and a deserialization vulnerability > that was otherwise resolved here in CVE-2017-12612. It'd be nice to both fix > the inconsistency and get the latest to further ensure that there isn't any > latent vulnerability here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted
[ https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Zituny updated SPARK-27330: Description: in cases where a micro batch is being killed (interrupted), not during actual processing done by the {{ForeachDataWriter}} (when iterating the iterator), {{DataWritingSparkTask}} will handle the interruption and call {{dataWriter.abort()}} the problem is that {{ForeachDataWriter}} has an empty implementation for the abort method. due to that, I have tasks which uses the foreach writer and according to the [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] they are opening connections in the "open" method and closing the connections on the "close" method but since the "close" is never called, the connections are never closed this wasn't the behavior pre spark 2.4 my suggestion is to call {{ForeachWriter.abort()}} when {{DataWriter.abort()}} is called, in order to notify the foreach writer that this task has failed {code:java} stack trace from the exception i have encountered: org.apache.spark.TaskKilledException: null at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) {code} was: in cases where a micro batch is being killed (interrupted), not during actual processing done by the {{ForeachDataWriter}} (when iterating the iterator), {{DataWritingSparkTask}} will handle the interruption and call {{dataWriter.abort()}} the problem is that {{ForeachDataWriter}} has an empty implementation for the abort method. due to that, I have tasks which uses the foreach writer and according to the [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] they are opening connections in the "open" method and closing the connections on the "close" method but since the "close" is never called, the connections are never closed this wasn't the behavior pre spark 2.4 my suggestion is to call {{ForeachWriter.close()}} when {{DataWriter.abort()}} is called, and exception should also be provided in order to notify the foreach writer that this task has failed {code} stack trace from the exception i have encountered: org.apache.spark.TaskKilledException: null at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) {code} > ForeachWriter is not being closed once a batch is aborted > - > > Key: SPARK-27330 > URL: https://issues.apache.org/jira/browse/SPARK-27330 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Eyal Zituny >Priority: Major > > in cases where a micro batch is being killed (interrupted), not during actual > processing done by the {{ForeachDataWriter}} (when iterating the iterator), > {{DataWritingSparkTask}} will handle the interruption and call > {{dataWriter.abort()}} > the problem is that {{ForeachDataWriter}} has an empty implementation for the > abort method. > due
[jira] [Resolved] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts
[ https://issues.apache.org/jira/browse/SPARK-27397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27397. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24308 [https://github.com/apache/spark/pull/24308] > Take care of OpenJ9 in JVM dependant parts > -- > > Key: SPARK-27397 > URL: https://issues.apache.org/jira/browse/SPARK-27397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 3.0.0 > > > Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The > current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been > released. However, it is not considered yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27330) ForeachWriter is not being closed once a batch is aborted
[ https://issues.apache.org/jira/browse/SPARK-27330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819054#comment-16819054 ] Eyal Zituny commented on SPARK-27330: - [~gsomogyi] i've submitted a PR > ForeachWriter is not being closed once a batch is aborted > - > > Key: SPARK-27330 > URL: https://issues.apache.org/jira/browse/SPARK-27330 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Eyal Zituny >Priority: Major > > in cases where a micro batch is being killed (interrupted), not during actual > processing done by the {{ForeachDataWriter}} (when iterating the iterator), > {{DataWritingSparkTask}} will handle the interruption and call > {{dataWriter.abort()}} > the problem is that {{ForeachDataWriter}} has an empty implementation for the > abort method. > due to that, I have tasks which uses the foreach writer and according to the > [documentation|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > they are opening connections in the "open" method and closing the > connections on the "close" method but since the "close" is never called, the > connections are never closed > this wasn't the behavior pre spark 2.4 > my suggestion is to call {{ForeachWriter.close()}} when > {{DataWriter.abort()}} is called, and exception should also be provided in > order to notify the foreach writer that this task has failed > > {code} > stack trace from the exception i have encountered: > org.apache.spark.TaskKilledException: null > at > org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27397) Take care of OpenJ9 in JVM dependant parts
[ https://issues.apache.org/jira/browse/SPARK-27397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27397: - Assignee: Kazuaki Ishizaki > Take care of OpenJ9 in JVM dependant parts > -- > > Key: SPARK-27397 > URL: https://issues.apache.org/jira/browse/SPARK-27397 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > > Spark includes multiple JVM dependant code such as {{SizeEstimator}}. The > current Spark takes care of IBM JDK and OpenJDK. Recently, OpenJ9 has been > released. However, it is not considered yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818963#comment-16818963 ] Gabor Somogyi commented on SPARK-27409: --- After I've modified the code the same way just like in scala case and checked it with 2.3.2 pyspark here are my thoughts. `trigger=continuous...` controls mainly the execution (MicroBatchExecution vs ContinuousExecution). See the code [here|https://github.com/apache/spark/blob/02b510728c31b70e6035ad541bfcdc2b59dcd79a/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L243] This code is significantly re-written on master branch. Does this cause any issue? > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) > E ... 23 more > E Caused by: java.io.FileNotFoundException: non-existent (No such file or > directory) > E at java.io.FileInputStream.open0(Native Method) > E at java.io.FileInputStream.open(FileInputStream.java:195) > E at java.io.FileInputStream.(FileInputStream.java:138) > E at java.io.FileInputStream.(FileInputStream.java:93) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) > E at > org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) > E ... 24 more{code} > When running a simple data stream loader for kafka
[jira] [Updated] (SPARK-27474) avoid retrying a task failed with CommitDeniedException many times
[ https://issues.apache.org/jira/browse/SPARK-27474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27474: Summary: avoid retrying a task failed with CommitDeniedException many times (was: try best to not submit tasks when the partitions are already completed) > avoid retrying a task failed with CommitDeniedException many times > -- > > Key: SPARK-27474 > URL: https://issues.apache.org/jira/browse/SPARK-27474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27470) Upgrade pyrolite to 4.23
[ https://issues.apache.org/jira/browse/SPARK-27470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27470. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24381 [https://github.com/apache/spark/pull/24381] > Upgrade pyrolite to 4.23 > > > Key: SPARK-27470 > URL: https://issues.apache.org/jira/browse/SPARK-27470 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.4.1, 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > We can/should upgrade the pyrolite dependence to the latest, 4.23, to pick up > bug and security fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct
[ https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818889#comment-16818889 ] Daniel Tomes commented on SPARK-27468: -- Excellent [~shahid], if you need any assistance replicating, let me know; I can recreate the issue but you should be able to as well. Thanks > "Storage Level" in "RDD Storage Page" is not correct > > > Key: SPARK-27468 > URL: https://issues.apache.org/jira/browse/SPARK-27468 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Shixiong Zhu >Priority: Major > > I ran the following unit test and checked the UI. > {code} > val conf = new SparkConf() > .setAppName("test") > .setMaster("local-cluster[2,1,1024]") > .set("spark.ui.enabled", "true") > sc = new SparkContext(conf) > val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2) > rdd.count() > Thread.sleep(360) > {code} > The storage level is "Memory Deserialized 1x Replicated" in the RDD storage > page. > I tried to debug and found this is because Spark emitted the following two > events: > {code} > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 > replicas),56,0)) > event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, > 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 > replicas),56,0)) > {code} > The storage level in the second event will overwrite the first one. "1 > replicas" comes from this line: > https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457 > Maybe AppStatusListener should calculate the replicas from events? > Another fact we may need to think about is when replicas is 2, will two Spark > events arrive in the same order? Currently, two RPCs from different executors > can arrive in any order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818882#comment-16818882 ] Gabor Somogyi commented on SPARK-27409: --- [~pbharaj] {quote}Yes, I'm following the kafka integration guide linked.{quote} {code:java} $ ./bin/pyspark Python 2.7.10 (default, Aug 17 2018, 17:41:52) [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin Type "help", "copyright", "credits" or "license" for more information. 19/04/16 12:24:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Python version 2.7.10 (default, Aug 17 2018 17:41:52) SparkSession available as 'spark'. >>> df = sc.sql.readStream.format('kafka').option('kafka.bootstrap.servers', >>> 'localhost:9093').option("kafka.security.protocol", >>> "SSL",).option("kafka.ssl.keystore.password", >>> "").option("kafka.ssl.keystore.type", >>> "PKCS12").option("kafka.ssl.keystore.location", >>> 'non-existent').option('subscribe', 'no existing topic').load() Traceback (most recent call last): File "", line 1, in AttributeError: 'SparkContext' object has no attribute 'sql' >>> {code} > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at >
[jira] [Created] (SPARK-27474) try best to not submit tasks when the partitions are already completed
Wenchen Fan created SPARK-27474: --- Summary: try best to not submit tasks when the partitions are already completed Key: SPARK-27474 URL: https://issues.apache.org/jira/browse/SPARK-27474 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818707#comment-16818707 ] Wenchen Fan commented on SPARK-25250: - good points! I'm creating another ticket for the issue, and leave this as resolved. > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.3.4, 2.4.1, 3.0.0 > > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple tim
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25250. - Resolution: Fixed > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.3.4, 3.0.0, 2.4.1 > > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27448) File source V2 table provider should be compatible with V1 provider
[ https://issues.apache.org/jira/browse/SPARK-27448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27448. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24356 [https://github.com/apache/spark/pull/24356] > File source V2 table provider should be compatible with V1 provider > --- > > Key: SPARK-27448 > URL: https://issues.apache.org/jira/browse/SPARK-27448 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > In the rule `PreprocessTableCreation`, if an existing table is appended with > a different provider, the action will fail. > Currently, there are two implementations for file sources and creating a > table with file source V2 will always fall back to V1 FileFormat. We should > consider the following cases as valid: > 1. Appending a table with file source V2 provider using the v1 file format > 2. Appending a table with v1 file format provider using file source V2 format > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27448) File source V2 table provider should be compatible with V1 provider
[ https://issues.apache.org/jira/browse/SPARK-27448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27448: --- Assignee: Gengliang Wang > File source V2 table provider should be compatible with V1 provider > --- > > Key: SPARK-27448 > URL: https://issues.apache.org/jira/browse/SPARK-27448 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > In the rule `PreprocessTableCreation`, if an existing table is appended with > a different provider, the action will fail. > Currently, there are two implementations for file sources and creating a > table with file source V2 will always fall back to V1 FileFormat. We should > consider the following cases as valid: > 1. Appending a table with file source V2 provider using the v1 file format > 2. Appending a table with v1 file format provider using file source V2 format > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27465) Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package
[ https://issues.apache.org/jira/browse/SPARK-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen updated SPARK-27465: Description: Hi Team, We are getting the below exceptions with Kafka Client Version 0.11.0.0 for KafkaTestUtils Package. But its working fine when we use the Kafka Client Version 0.10.0.1. Please suggest the way forwards. We are using the package " org.apache.spark.streaming.kafka010.KafkaTestUtils;" And the Spark Streaming Version is 2.2.3 and above. ERROR: java.lang.NoSuchMethodError: kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time; at org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110) at org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226) at org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107) at org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122) at com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203) at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157) at com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58) was: Hi Team, We are getting the below exceptions with Kafka Client Version 0.11.0.0 for KafkaTestUtils Package. But its working fine when we use the Kafka Client Version 0.10.0.1. Please suggest the way forwards. We are using the package " import org.apache.spark.streaming.kafka010.KafkaTestUtils;" ERROR: java.lang.NoSuchMethodError: kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time; at org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110) at org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226) at org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107) at org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122) at com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203) at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157) at com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58) > Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package > -- > > Key: SPARK-27465 > URL: https://issues.apache.org/jira/browse/SPARK-27465 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1 >Reporter: Praveen >Priority: Critical > > Hi Team, > We are getting the below exceptions with Kafka Client Version 0.11.0.0 for > KafkaTestUtils Package. But its working fine when we use the Kafka Client > Version 0.10.0.1. Please suggest the way forwards. We are using the package " > org.apache.spark.streaming.kafka010.KafkaTestUtils;" > And the Spark Streaming Version is 2.2.3 and above. > > ERROR: > java.lang.NoSuchMethodError: > kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time; > at > org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122) > at > com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203) > at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157) > at > com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58) -- This message was sent by Atlassian JIRA (v7.6.3#76005)