[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
[ https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey(Xilang) Yan updated SPARK-27894: Description: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {code:java} PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. {code} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. was: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. > PySpark streaming transform RDD join not works when checkpoint enabled > -- > > Key: SPARK-27894 > URL: https://issues.apache.org/jira/browse/SPARK-27894 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jeffrey(Xilang) Yan >Priority: Major > > In PySpark Steaming, if checkpoint enabled and there is a transform-join > operation, the error thrown. > {code:java} > sc=SparkContext(appName='') > sc.setLogLevel("WARN") > ssc=StreamingContext(sc,10) > ssc.checkpoint("hdfs:///test") > kafka_bootstrap_servers="" > topics = ['', ''] > doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) > kvds=KafkaUtils.createDirectStream(ssc, topics, > kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) > line=kvds.map(lambda x:(1,2)) > line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) > ssc.start() > ssc.awaitTermination() > {code} > > Error details: > {code:java} > PicklingError: Could not serialize object: Exception: It appears that you are > attempting to broadcast an RDD or reference an RDD from an action or > transformation. RDD transformations and actions can only be invoked by the > driver, not inside of other transformations; for example, rdd1.map(lambda x: > rdd2.values.count() * x) is invalid because the values transformation and > count action cannot be performed inside of the rdd1.map transformation. For > more information, see SPARK-5063. > {code} > The similar code works great in Scala. And if we remove any of > {code:java} > ssc.checkpoint("hdfs:///test") > {code} > o
[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
[ https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey(Xilang) Yan updated SPARK-27894: Description: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. was: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {{sc=SparkContext(appName='') }} {{sc.setLogLevel("WARN") }} {{ssc=StreamingContext(sc,10) }} {{ssc.checkpoint("hdfs:///test") }} {{kafka_bootstrap_servers="" }} {{topics = ['', ''] }} {{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}} {{ kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }} {{line=kvds.map(lambda x:(1,2)) }} {{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }} {{ssc.start() }} {{ssc.awaitTermination() }} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {{ssc.checkpoint("hdfs:///test") }} or {{line.transform(lambda rdd:rdd.join(doc_info)) }} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. > PySpark streaming transform RDD join not works when checkpoint enabled > -- > > Key: SPARK-27894 > URL: https://issues.apache.org/jira/browse/SPARK-27894 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jeffrey(Xilang) Yan >Priority: Major > > In PySpark Steaming, if checkpoint enabled and there is a transform-join > operation, the error thrown. > {code:java} > sc=SparkContext(appName='') > sc.setLogLevel("WARN") > ssc=StreamingContext(sc,10) > ssc.checkpoint("hdfs:///test") > kafka_bootstrap_servers="" > topics = ['', ''] > doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) > kvds=KafkaUtils.createDirectStream(ssc, topics, > kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) > line=kvds.map(lambda x:(1,2)) > line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) > ssc.start() > ssc.awaitTermination() > {code} > > Error details: > {{PicklingError: Could not serialize object: Exception: It appears that you > are attempting to broadcast an RDD or reference an RDD from an action or > transformation. RDD transformations and actions can only be invoked by the > driver, not inside of other transformations; for example, rdd1.map(lambda x: > rdd2.values.count() * x) is invalid because the values transformation and > count action cannot be performed inside of the rdd1.map transformation. For > more information, see SPARK-5063. }} > The similar code works great in Scala. And if we remove any of > {code:java} > ssc.checkpoint("hdfs:///test") > {code} > or > {cod
[jira] [Created] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
Jeffrey(Xilang) Yan created SPARK-27894: --- Summary: PySpark streaming transform RDD join not works when checkpoint enabled Key: SPARK-27894 URL: https://issues.apache.org/jira/browse/SPARK-27894 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Jeffrey(Xilang) Yan In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {{sc=SparkContext(appName='') }} {{sc.setLogLevel("WARN") }} {{ssc=StreamingContext(sc,10) }} {{ssc.checkpoint("hdfs:///test") }} {{kafka_bootstrap_servers="" }} {{topics = ['', ''] }} {{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}} {{ kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }} {{line=kvds.map(lambda x:(1,2)) }} {{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }} {{ssc.start() }} {{ssc.awaitTermination() }} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {{ssc.checkpoint("hdfs:///test") }} or {{line.transform(lambda rdd:rdd.join(doc_info)) }} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files
[ https://issues.apache.org/jira/browse/SPARK-27893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27893: Assignee: Apache Spark > Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql > files > > > Key: SPARK-27893 > URL: https://issues.apache.org/jira/browse/SPARK-27893 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Critical > > See https://github.com/apache/spark/pull/24675#issuecomment-495591310 > Due to lack of Python tests, and due to difficulty about writing a PySpark > tests for plans, we have faced multiple issues so far. > This JIRA targets to add a test base that works with SQL files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files
[ https://issues.apache.org/jira/browse/SPARK-27893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27893: Assignee: (was: Apache Spark) > Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql > files > > > Key: SPARK-27893 > URL: https://issues.apache.org/jira/browse/SPARK-27893 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Critical > > See https://github.com/apache/spark/pull/24675#issuecomment-495591310 > Due to lack of Python tests, and due to difficulty about writing a PySpark > tests for plans, we have faced multiple issues so far. > This JIRA targets to add a test base that works with SQL files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files
Hyukjin Kwon created SPARK-27893: Summary: Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files Key: SPARK-27893 URL: https://issues.apache.org/jira/browse/SPARK-27893 Project: Spark Issue Type: Test Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon See https://github.com/apache/spark/pull/24675#issuecomment-495591310 Due to lack of Python tests, and due to difficulty about writing a PySpark tests for plans, we have faced multiple issues so far. This JIRA targets to add a test base that works with SQL files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852541#comment-16852541 ] Karthik Palaniappan edited comment on SPARK-24815 at 5/31/19 2:26 AM: -- I was starting to update the JIRA description with a problem statement, then realized I am unfamiliar with some of the challenges you guys mentioned in the comments, in particular how state is managed in structured streaming. I was imagining that processing rate was the correct heuristic, assuming the goal is to just keep up with the input, even at the expense of processing time. Continuous processing seems to solve the separate case where you need ultra low latency processing. [~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy to help with the implementation, but for now I will drop this JIRA. was (Author: karthik palaniappan): I was starting to update the JIRA description with a problem statement, then realized I am unfamiliar with some of the challenges you guys mentioned in the comments, in particular how state is managed in structured streaming. I was imagining that processing rate was the correct heuristic, assuming the goal is to keep up with the input. Continuous processing seems to solve the separate case where you need ultra low latency. [~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy to help with the implementation, but for now I will drop this JIRA. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852573#comment-16852573 ] hemshankar sahu commented on SPARK-27891: - Sure I'll provide for 2.3.1 in some time as it needs to run for long time. > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27876) Split large shuffle partition to multi-segments to enable transfer oversize shuffle partition block.
[ https://issues.apache.org/jira/browse/SPARK-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27876: Affects Version/s: (was: 3.1.0) (was: 2.4.3) 2.3.2 > Split large shuffle partition to multi-segments to enable transfer oversize > shuffle partition block. > > > Key: SPARK-27876 > URL: https://issues.apache.org/jira/browse/SPARK-27876 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > There is a limit for shuffle read. > If a shuffle partition block's size is large than Integer.MaxValue(2GB) and > this block is fetched from remote, an Exception will be thrown. > {code:java} > 2019-05-24 06:46:30,333 [9935] - WARN > [shuffle-client-6-2:TransportChannelHandler@78] - Exception in connection > from hadoop3747.jd.163.org/10.196.76.172:7337 > java.lang.IllegalArgumentException: Too large frame: 2991947178 > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > {code} > Then this task would throw a fetchFailedException. > This task will retry and it would execute successfully only when this task > was reScheduled to a executor whose host is same to this oversize shuffle > partition block. > However, if there are more than one oversize(>2GB) shuffle partitions block, > this task would never execute successfully and it may cause the failure of > application. > In this PR, I propose a new method to fetch shuffle block, it would fetch > multi times when the relative shuffle partition block is oversize. > The simple brief introduction: > 1. Set a shuffle fetch threshold(SHUFFLE_FETCH_THRESHOLD) to Int.MaxValue(2GB) > 2. Set a parameter spark.shuffle.fetch.split to control whether enable fetch > large partition multi times > 3. When creating mapStatus, caucluate the segemens of shuffle block > (Math.ceil(size /SHUFFLE_FETCH_THRESHOLD )), and only record the segment > number which is large than 1. > 4. Define a new BlockId type, ShuffleBlockSegmentId, used to identifiy the > fetch method. > 5. When spark.shuffle.fetch.split is enabled, send ShuffleBlockSegmentId > message to shuffleService instead of ShuffleBlockId message. > 6. For a ShuffleBlockId, use a sequence of ManagedBuffers to present its > block instead of a ManagedBuffer. > 7. In ShuffleBlockFetcherIterator, create a PriorityBlockQueue for a > ShuffleBlockId to store the fetched SegmentManagedBuffer, when all segments > of a ShuffleBlockId are fetched, take relative sequence of > managedBuffers(which are ordered by segmentId) as a successResult for a > ShuffleBlockID. > 8. In the shuffle serivice side, if the blockId of openBlocks is a > ShuffleBlockSegmentId, response a segment managedBuffer of block , if the > blockId is a ShuffleBlockId response a whole managedBuffer of block as before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852541#comment-16852541 ] Karthik Palaniappan commented on SPARK-24815: - I was starting to update the JIRA description with a problem statement, then realized I am unfamiliar with some of the challenges you guys mentioned in the comments, in particular how state is managed in structured streaming. I was imagining that processing rate was the correct heuristic, assuming the goal is to keep up with the input. Continuous processing seems to solve the separate case where you need ultra low latency. [~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy to help with the implementation, but for now I will drop this JIRA. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Palaniappan updated SPARK-24815: Description: For batch jobs, dynamic allocation is very useful for adding and removing containers to match the actual workload. On multi-tenant clusters, it ensures that a Spark job is taking no more resources than necessary. In cloud environments, it enables autoscaling. However, if you set spark.dynamicAllocation.enabled=true and run a structured streaming job, the batch dynamic allocation algorithm kicks in. It requests more executors if the task backlog is a certain size, and removes executors if they idle for a certain period of time. Quick thoughts: 1) Dynamic allocation should be pluggable, rather than hardcoded to a particular implementation in SparkContext.scala (this should be a separate JIRA). 2) We should make a structured streaming algorithm that's separate from the batch algorithm. Eventually, continuous processing might need its own algorithm. 3) Spark should print a warning if you run a structured streaming job when Core's dynamic allocation is enabled was: Dynamic allocation is very useful for adding and removing containers to match the actual workload. On multi-tenant clusters, it ensures that a Spark job is taking no more resources than necessary. In cloud environments, it enables autoscaling. However, if you set spark.dynamicAllocation.enabled=true and run a structured streaming job, Core's dynamic allocation algorithm kicks in. It requests executors if the task backlog is a certain size, and remove executors if they idle for a certain period of time. This does not make sense for streaming jobs, as outlined in https://issues.apache.org/jira/browse/SPARK-12133, which introduced dynamic allocation for the old streaming API. First, Spark should print a warning if you run a structured streaming job when Core's dynamic allocation is enabled Second, structured streaming should have support for dynamic allocation. It would be convenient if it were the same set of properties as Core's dynamic allocation, but I don't have a strong opinion on that. If somebody can give me pointers on how to add dynamic allocation support, I'd be happy to take a stab. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852508#comment-16852508 ] Hyukjin Kwon commented on SPARK-27884: -- +1 > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27862) Upgrade json4s-jackson to 3.6.6
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27862: - Assignee: Izek Greenfield > Upgrade json4s-jackson to 3.6.6 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27862) Upgrade json4s-jackson to 3.6.6
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27862. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24736 [https://github.com/apache/spark/pull/24736] > Upgrade json4s-jackson to 3.6.6 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Minor > Fix For: 3.0.0 > > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27892) Saving/loading stages in PipelineModel should be parallel
[ https://issues.apache.org/jira/browse/SPARK-27892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Wang updated SPARK-27892: --- Description: When a PipelineModel is saved/loaded, all the stages are saved/loaded sequentially. When dealing with a PipelineModel with many stages, although each stage's save/load takes sub-second, the total time taken for the PipelineModel could be several minutes. It should be trivial to parallelize the save/load of stages in the SharedReadWrite object. To reproduce: {code:java} import org.apache.spark.ml._ import org.apache.spark.ml.feature.VectorAssembler val outputPath = "..." val stages = (1 to 100) map { i => new VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)} val p = new Pipeline().setStages(stages.toArray) val data = Seq(1, 1, 1) toDF "input" val pm = p.fit(data) pm.save(outputPath){code} was:When a PipelineModel is saved/loaded, all the stages are saved/loaded sequentially. When dealing with a PipelineModel with many stages, although each stage's save/load takes sub-second, the total time taken for the PipelineModel could be several minutes. It should be trivial to parallelize the save/load of stages in the SharedReadWrite object. > Saving/loading stages in PipelineModel should be parallel > - > > Key: SPARK-27892 > URL: https://issues.apache.org/jira/browse/SPARK-27892 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Jason Wang >Priority: Minor > Labels: easyfix, performance > > When a PipelineModel is saved/loaded, all the stages are saved/loaded > sequentially. When dealing with a PipelineModel with many stages, although > each stage's save/load takes sub-second, the total time taken for the > PipelineModel could be several minutes. It should be trivial to parallelize > the save/load of stages in the SharedReadWrite object. > > To reproduce: > {code:java} > import org.apache.spark.ml._ > import org.apache.spark.ml.feature.VectorAssembler > val outputPath = "..." > val stages = (1 to 100) map { i => new > VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)} > val p = new Pipeline().setStages(stages.toArray) > val data = Seq(1, 1, 1) toDF "input" > val pm = p.fit(data) > pm.save(outputPath){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27892) Saving/loading stages in PipelineModel should be parallel
Jason Wang created SPARK-27892: -- Summary: Saving/loading stages in PipelineModel should be parallel Key: SPARK-27892 URL: https://issues.apache.org/jira/browse/SPARK-27892 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.3 Reporter: Jason Wang When a PipelineModel is saved/loaded, all the stages are saved/loaded sequentially. When dealing with a PipelineModel with many stages, although each stage's save/load takes sub-second, the total time taken for the PipelineModel could be several minutes. It should be trivial to parallelize the save/load of stages in the SharedReadWrite object. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives
[ https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-27684: -- Assignee: Marco Gaido > Reduce ScalaUDF conversion overheads for primitives > --- > > Key: SPARK-27684 > URL: https://issues.apache.org/jira/browse/SPARK-27684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Marco Gaido >Priority: Major > > I believe that we can reduce ScalaUDF overheads when operating over primitive > types. > In [ScalaUDF's > doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991] > we have logic to convert UDF function input types from Catalyst internal > types to Scala types (for example, this is used to convert UTF8Strings to > Java Strings). Similarly, we convert UDF return types. > However, UDF input argument conversion is effectively a no-op for primitive > types because {{CatalystTypeConverters.createToScalaConverter()}} returns > {{identity}} in those cases. UDF result conversion is a little tricker > because {{createToCatalystConverter()}} returns [a > function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413] > that handles {{Option[Primitive]}}, but it might be the case that the > Option-boxing is unusable via ScalaUDF (in which case the conversion truly is > an {{identity}} no-op). > These unnecessary no-op conversions could be quite expensive because each > call involves an index into the {{references}} array to get the converters, a > second index into the converters array to get the correct converter for the > nth input argument, and, finally, the converter invocation itself: > {code:java} > Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* > converters */)[0].apply(project_value_3);{code} > In these cases, I believe that we can reduce lookup / invocation overheads by > modifying the ScalaUDF code generation to eliminate the conversion calls for > primitives and directly assign the unconverted result, e.g. > {code:java} > Object project_arg_0 = false ? null : project_value_3;{code} > To cleanly handle the case where we have a multi-argument UDF accepting a > mixture of primitive and non-primitive types, we might be able to keep the > {{converters}} array the same size (so indexes stay the same) but omit the > invocation of the converters for the primitive arguments (e.g. {{converters}} > is sparse / contains unused entries in case of primitives). > I spotted this optimization while trying to construct some quick benchmarks > to measure UDF invocation overheads. For example: > {code:java} > spark.udf.register("identity", (x: Int) => x) > sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() > // ~ 52 seconds > sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 > * 1000 * 1000)").rdd.count() // ~84 seconds{code} > I'm curious to see whether the optimization suggested here can close this > performance gap. It'd also be a good idea to construct more principled > microbenchmarks covering multi-argument UDFs, projections involving multiple > UDFs over different input and output types, etc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives
[ https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-27684. Resolution: Fixed Fix Version/s: 3.0.0 Fixed for 3.0 in https://github.com/apache/spark/pull/24636 > Reduce ScalaUDF conversion overheads for primitives > --- > > Key: SPARK-27684 > URL: https://issues.apache.org/jira/browse/SPARK-27684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > I believe that we can reduce ScalaUDF overheads when operating over primitive > types. > In [ScalaUDF's > doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991] > we have logic to convert UDF function input types from Catalyst internal > types to Scala types (for example, this is used to convert UTF8Strings to > Java Strings). Similarly, we convert UDF return types. > However, UDF input argument conversion is effectively a no-op for primitive > types because {{CatalystTypeConverters.createToScalaConverter()}} returns > {{identity}} in those cases. UDF result conversion is a little tricker > because {{createToCatalystConverter()}} returns [a > function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413] > that handles {{Option[Primitive]}}, but it might be the case that the > Option-boxing is unusable via ScalaUDF (in which case the conversion truly is > an {{identity}} no-op). > These unnecessary no-op conversions could be quite expensive because each > call involves an index into the {{references}} array to get the converters, a > second index into the converters array to get the correct converter for the > nth input argument, and, finally, the converter invocation itself: > {code:java} > Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* > converters */)[0].apply(project_value_3);{code} > In these cases, I believe that we can reduce lookup / invocation overheads by > modifying the ScalaUDF code generation to eliminate the conversion calls for > primitives and directly assign the unconverted result, e.g. > {code:java} > Object project_arg_0 = false ? null : project_value_3;{code} > To cleanly handle the case where we have a multi-argument UDF accepting a > mixture of primitive and non-primitive types, we might be able to keep the > {{converters}} array the same size (so indexes stay the same) but omit the > invocation of the converters for the primitive arguments (e.g. {{converters}} > is sparse / contains unused entries in case of primitives). > I spotted this optimization while trying to construct some quick benchmarks > to measure UDF invocation overheads. For example: > {code:java} > spark.udf.register("identity", (x: Int) => x) > sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() > // ~ 52 seconds > sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 > * 1000 * 1000)").rdd.count() // ~84 seconds{code} > I'm curious to see whether the optimization suggested here can close this > performance gap. It'd also be a good idea to construct more principled > microbenchmarks covering multi-argument UDFs, projections involving multiple > UDFs over different input and output types, etc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
[ https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27890: Assignee: Apache Spark > Improve SQL parser error message when missing backquotes for identifiers with > hyphens > - > > Key: SPARK-27890 > URL: https://issues.apache.org/jira/browse/SPARK-27890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Assignee: Apache Spark >Priority: Major > > Current SQL parser's error message for hyphen-connected identifiers without > surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A > possible approach to tackle this is to explicitly capture these wrong usages > in the SQL parser. In this way, the end users can fix these errors more > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
[ https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27890: Assignee: (was: Apache Spark) > Improve SQL parser error message when missing backquotes for identifiers with > hyphens > - > > Key: SPARK-27890 > URL: https://issues.apache.org/jira/browse/SPARK-27890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yesheng Ma >Priority: Major > > Current SQL parser's error message for hyphen-connected identifiers without > surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A > possible approach to tackle this is to explicitly capture these wrong usages > in the SQL parser. In this way, the end users can fix these errors more > quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27772: -- Component/s: Tests > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > I believe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. > > Java's `try-with-resources` statement does not mask exception throwing in the > try block with any exception caught in the 'close()' statement. Exception > caught in the 'close()' statement would add as a suppressed exception > instead. It sounds a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { > var fromTryBlock: Throwable = null > try tryBlock catch { > case cause: Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock: Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developers to > identify what actually break their tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852420#comment-16852420 ] Marcelo Vanzin commented on SPARK-27891: Ok, the updated logs show the issue. But they're from Spark 2.2.0, which is EOL. If you can provide logs from the lastest 2.3 or 2.4 releases, that would be more helpful (since there's been a few changes in this area). > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu updated SPARK-27891: Description: When the spark job runs on a secured cluster for longer then time that is mentioned in the dfs.namenode.delegation.token.renew-interval property of hdfs-site.xml the spark job fails. ** Following command was used to submit the spark job bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py /tmp/ff1.txt Application Logs attached was: When the spark job runs on a secured cluster for longer then time that is mentioned in the dfs.namenode.delegation.token.renew-interval property of hdfs-site.xml the spark job fails. ** Following command was used to submit the spark job bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py /tmp/ff1.txt Application Logs pasted in Docs Text > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu updated SPARK-27891: Attachment: application_1559242207407_0001.log > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > Attachments: application_1559242207407_0001.log > > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs pasted in Docs Text > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852408#comment-16852408 ] Marcelo Vanzin commented on SPARK-27891: {{container_e48_1559242207407_0001_02_01}} tells me that's the second attempt. That makes your problem the same as SPARK-23361. > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs pasted in Docs Text > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu updated SPARK-27891: Description: When the spark job runs on a secured cluster for longer then time that is mentioned in the dfs.namenode.delegation.token.renew-interval property of hdfs-site.xml the spark job fails. ** Following command was used to submit the spark job bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py /tmp/ff1.txt Application Logs pasted in Docs Text was: When the spark job runs on a secured cluster for longer then time that is mentioned in the dfs.namenode.delegation.token.renew-interval property of hdfs-site.xml the spark job fails. ** Application Logs pasted in Docs Text > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > Following command was used to submit the spark job > bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab > --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py > /tmp/ff1.txt > > Application Logs pasted in Docs Text > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemshankar sahu reopened SPARK-27891: - We used following command to submit the job bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py /tmp/ff1.txt > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > > > Application Logs pasted in Docs Text > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852402#comment-16852402 ] Dongjoon Hyun edited comment on SPARK-27812 at 5/30/19 10:06 PM: - Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, [~igor.calabria]. Can we move forward to resolve the issue by upgrading the libraries? was (Author: dongjoon): Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, [~igor.calabria]. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852402#comment-16852402 ] Dongjoon Hyun commented on SPARK-27812: --- Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, [~igor.calabria]. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
[ https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-22151: -- Comment: was deleted (was: Is there is a workaround for this in Apache Livy? We're still on Spark 2.3 ..) > PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly > -- > > Key: SPARK-22151 > URL: https://issues.apache.org/jira/browse/SPARK-22151 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.4.0 > > > Running in yarn cluster mode and trying to set pythonpath via > spark.yarn.appMasterEnv.PYTHONPATH doesn't work. > the yarn Client code looks at the env variables: > val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) > But when you set spark.yarn.appMasterEnv it puts it into the local env. > So the python path set in spark.yarn.appMasterEnv isn't properly set. > You can work around if you are running in cluster mode by setting it on the > client like: > PYTHONPATH=./addon/python/ spark-submit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
[ https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27891. Resolution: Not A Problem You have to provide a keytab for Spark for this to work. That's explained in the docs: http://spark.apache.org/docs/latest/security.html#kerberos > Long running spark jobs fail because of HDFS delegation token expires > - > > Key: SPARK-27891 > URL: https://issues.apache.org/jira/browse/SPARK-27891 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1 >Reporter: hemshankar sahu >Priority: Major > > When the spark job runs on a secured cluster for longer then time that is > mentioned in the dfs.namenode.delegation.token.renew-interval property of > hdfs-site.xml the spark job fails. ** > > > Application Logs pasted in Docs Text > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires
hemshankar sahu created SPARK-27891: --- Summary: Long running spark jobs fail because of HDFS delegation token expires Key: SPARK-27891 URL: https://issues.apache.org/jira/browse/SPARK-27891 Project: Spark Issue Type: Bug Components: Security Affects Versions: 2.4.1, 2.3.1, 2.1.0, 2.0.1 Reporter: hemshankar sahu When the spark job runs on a secured cluster for longer then time that is mentioned in the dfs.namenode.delegation.token.renew-interval property of hdfs-site.xml the spark job fails. ** Application Logs pasted in Docs Text -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27361. --- Resolution: Fixed Fix Version/s: 3.0.0 all the subtasks are finished and the parts of the spark on hadoop 3.x that we need are also merged so resolving this. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * How the Spark executor discovers GPU's when run on YARN > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
[ https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852381#comment-16852381 ] Ruslan Dautkhanov commented on SPARK-22151: --- Is there is a workaround for this in Apache Livy? We're still on Spark 2.3 .. > PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly > -- > > Key: SPARK-22151 > URL: https://issues.apache.org/jira/browse/SPARK-22151 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Thomas Graves >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.4.0 > > > Running in yarn cluster mode and trying to set pythonpath via > spark.yarn.appMasterEnv.PYTHONPATH doesn't work. > the yarn Client code looks at the env variables: > val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) > But when you set spark.yarn.appMasterEnv it puts it into the local env. > So the python path set in spark.yarn.appMasterEnv isn't properly set. > You can work around if you are running in cluster mode by setting it on the > client like: > PYTHONPATH=./addon/python/ spark-submit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens
Yesheng Ma created SPARK-27890: -- Summary: Improve SQL parser error message when missing backquotes for identifiers with hyphens Key: SPARK-27890 URL: https://issues.apache.org/jira/browse/SPARK-27890 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Yesheng Ma Current SQL parser's error message for hyphen-connected identifiers without surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A possible approach to tackle this is to explicitly capture these wrong usages in the SQL parser. In this way, the end users can fix these errors more quickly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27872: Assignee: Apache Spark > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Major > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27872: Assignee: (was: Apache Spark) > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Priority: Major > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation
[ https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852365#comment-16852365 ] Dhruve Ashar commented on SPARK-24149: -- IMHO having a consistent way to reason about a behavior is more preferable over specifying an additional config. We encountered an issue while trying to get the filesystem using a logical nameservice (That is one more reason why this is broken) based on => [https://github.com/apache/spark/pull/21216/files#diff-f8659513cf91c15097428c3d8dfbcc35R213] > Automatic namespaces discovery in HDFS federation > - > > Key: SPARK-24149 > URL: https://issues.apache.org/jira/browse/SPARK-24149 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > Hadoop 3 introduced HDFS federation. > Spark fails to write on different namespaces when Hadoop federation is turned > on and the cluster is secure. This happens because Spark looks for the > delegation token only for the defaultFS configured and not for all the > available namespaces. A workaround is the usage of the property > {{spark.yarn.access.hadoopFileSystems}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler
[ https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27773. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24645 [https://github.com/apache/spark/pull/24645] > Add shuffle service metric for number of exceptions caught in > ExternalShuffleBlockHandler > - > > Key: SPARK-27773 > URL: https://issues.apache.org/jira/browse/SPARK-27773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.3 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Minor > Fix For: 3.0.0 > > > The health of the external shuffle service is currently difficult to monitor. > At least for the YARN shuffle service, the only current indication of health > is whether or not the shuffle service threads are running in the NodeManager. > However, we've seen that clients can sometimes experience elevated failure > rates on requests to the shuffle service even when those threads are running. > It would be helpful to have some indication of how often requests to the > shuffle service are failing, as this could be monitored, alerted on, etc. > One suggestion (implemented in the PR I'll attach to this ticket) is to add a > metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of > how many times we caught an exception in the shuffle service's RPC handler. I > think that this gives us the insight into request failure rates that we're > currently missing, but obviously I'm open to alternatives as well if people > have other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler
[ https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27773: -- Assignee: Steven Rand > Add shuffle service metric for number of exceptions caught in > ExternalShuffleBlockHandler > - > > Key: SPARK-27773 > URL: https://issues.apache.org/jira/browse/SPARK-27773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.3 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Minor > > The health of the external shuffle service is currently difficult to monitor. > At least for the YARN shuffle service, the only current indication of health > is whether or not the shuffle service threads are running in the NodeManager. > However, we've seen that clients can sometimes experience elevated failure > rates on requests to the shuffle service even when those threads are running. > It would be helpful to have some indication of how often requests to the > shuffle service are failing, as this could be monitored, alerted on, etc. > One suggestion (implemented in the PR I'll attach to this ticket) is to add a > metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of > how many times we caught an exception in the shuffle service's RPC handler. I > think that this gives us the insight into request failure rates that we're > currently missing, but obviously I'm open to alternatives as well if people > have other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852342#comment-16852342 ] Igor Calabria commented on SPARK-27812: --- I believe this was introduced when `kubernetes-client` was updated in https://issues.apache.org/jira/browse/SPARK-26742 . I can confirm that reverting [https://github.com/apache/spark/commit/2d4e9cf84b85a5f8278276e8d8ff59f6f4b11c4c] fixes this issue. This is funny because it was already discussed here (2017): [https://github.com/apache-spark-on-k8s/spark/pull/216#discussion_r112641765 |https://github.com/apache-spark-on-k8s/spark/pull/216#discussion_r112641765.] The previous solution was setting `.withWebsocketPingInterval({color:#6897bb}0{color})` option when instantiating the client. This prevented the user thread from being created. Maybe something changed with okhttp or kubernetes-client and this is no longer the case. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27378) spark-submit requests GPUs in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27378. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24634 [https://github.com/apache/spark/pull/24634] > spark-submit requests GPUs in YARN mode > --- > > Key: SPARK-27378 > URL: https://issues.apache.org/jira/browse/SPARK-27378 > Project: Spark > Issue Type: Sub-task > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27378) spark-submit requests GPUs in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27378: -- Assignee: Thomas Graves > spark-submit requests GPUs in YARN mode > --- > > Key: SPARK-27378 > URL: https://issues.apache.org/jira/browse/SPARK-27378 > Project: Spark > Issue Type: Sub-task > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852276#comment-16852276 ] Xiangrui Meng commented on SPARK-27886: --- By "at the end of year", you mean year 2020, right? The proposed timeline sets the switch to 2020/04, a rough estimate of Spark 3.1 release. > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852268#comment-16852268 ] Dongjoon Hyun commented on SPARK-27798: --- Thank you so much for the investigation and update, [~Gengliang.Wang]. > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852238#comment-16852238 ] Dongjoon Hyun commented on SPARK-27884: --- Thanks! +1 for this efforts > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852166#comment-16852166 ] Sean Owen commented on SPARK-27886: --- Maybe I misunderstand, but if it's likely Python 2 goes away from Spark in 3.1, and that is before the end of the year probably, don't we want to set the expectation that Python 2 support will go away at the end of the year? (if it happens to be later, that's better than going away earlier than advertised) > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852149#comment-16852149 ] Sean Owen commented on SPARK-27884: --- Yes that looks fine to me. > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852148#comment-16852148 ] William Wong commented on SPARK-27772: -- Hi [~hyukjin.kwon], I submitted the PR and add a test case to demonstrate how the change behaves. Basically, one of the example is if something create a test by providing 'null' table or cache for those WithXXX method, we should hit a null pointer exception in related closing block (finally block). the null pointer exception would mask the true exception. > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > I believe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. > > Java's `try-with-resources` statement does not mask exception throwing in the > try block with any exception caught in the 'close()' statement. Exception > caught in the 'close()' statement would add as a suppressed exception > instead. It sounds a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { > var fromTryBlock: Throwable = null > try tryBlock catch { > case cause: Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock: Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developers to > identify what actually break their tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27889) Make development scripts under dev/ support Python 3
Xiangrui Meng created SPARK-27889: - Summary: Make development scripts under dev/ support Python 3 Key: SPARK-27889 URL: https://issues.apache.org/jira/browse/SPARK-27889 Project: Spark Issue Type: Sub-task Components: Build, Deploy Affects Versions: 3.0.0 Reporter: Xiangrui Meng Some of our internal python scripts under dev/ only support Python 2. With deprecation of Python 2, we should make those scripts support Python 3. So developers have a way to avoid seeing the deprecation warning. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time
[ https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852136#comment-16852136 ] Gabor Somogyi commented on SPARK-27648: --- I've made my graphs based on the files you've provided and both states looks stable in terms of memory consumption. I would suggest 2 memory dumps (one at the beginning and one at the end) and compare them (maybe with Memory Analyzer Tool). The problematic part can be nearly anything. > In Spark2.4 Structured Streaming:The executor storage memory increasing over > time > - > > Key: SPARK-27648 > URL: https://issues.apache.org/jira/browse/SPARK-27648 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: tommy duan >Priority: Major > Attachments: houragg(1).out, houragg_filter.csv, > houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, > image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, > image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png > > > *Spark Program Code Business:* > Read the topic on kafka, aggregate the stream data sources, and then output > it to another topic line of kafka. > *Problem Description:* > *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory > overflow problems often occur (because of too many versions of state stored > in memory, this bug has been modified in spark 2.4). > {code:java} > /spark-submit \ > --conf “spark.yarn.executor.memoryOverhead=4096M” > --num-executors 15 \ > --executor-memory 3G \ > --executor-cores 2 \ > --driver-memory 6G \{code} > {code} > Executor memory exceptions occur when running with this submit resource under > SPARK 2.2 and the normal running time does not exceed one day. > The solution is to set the executor memory larger than before > {code:java} > My spark-submit script is as follows: > /spark-submit\ > conf "spark. yarn. executor. memoryOverhead = 4096M" > num-executors 15\ > executor-memory 46G\ > executor-cores 3\ > driver-memory 6G\ > ...{code} > In this case, the spark program can be guaranteed to run stably for a long > time, and the executor storage memory is less than 10M (it has been running > stably for more than 20 days). > *2) From the upgrade information of Spark 2.4, we can see that the problem of > large memory consumption of state storage has been solved in Spark 2.4.* > So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, > and found that the use of memory was reduced. > But a problem arises, as the running time increases, the storage memory of > executor is growing (see Executors - > Storage Memory from the Spark on Yarn > Resource Manager UI). > This program has been running for 14 days (under SPARK 2.2, running with > this submit resource, the normal running time is not more than one day, > Executor memory abnormalities will occur). > The script submitted by the program under spark2.4 is as follows: > {code:java} > /spark-submit \ > --conf “spark.yarn.executor.memoryOverhead=4096M” > --num-executors 15 \ > --executor-memory 3G \ > --executor-cores 2 \ > --driver-memory 6G > {code} > Under Spark 2.4, I counted the size of executor memory as time went by during > the running of the spark program: > |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)| > |23.5H|41.6MB/1.5GB|1.770212766| > |108.4H|460.2MB/1.5GB|4.245387454| > |131.7H|559.1MB/1.5GB|4.245254366| > |135.4H|575MB/1.5GB|4.246676514| > |153.6H|641.2MB/1.5GB|4.174479167| > |219H|888.1MB/1.5GB|4.055251142| > |263H|1126.4MB/1.5GB|4.282889734| > |309H|1228.8MB/1.5GB|3.976699029| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27772: - Description: The current `SQLTestUtils` created many `withXXX` utility functions to clean up tables/views/caches created for testing purpose. Some of those `withXXX` functions ignore certain exceptions, like `NoSuchTableException` in the clean up block (ie, the finally block). {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { try f finally { // If the test failed part way, we don't want to mask the failure by failing to remove // temp views that never got created. try viewNames.foreach(spark.catalog.dropTempView) catch { case _: NoSuchTableException => } } } {code} I believe it is not the best approach. Because it is hard to anticipate what exception should or should not be ignored. Java's `try-with-resources` statement does not mask exception throwing in the try block with any exception caught in the 'close()' statement. Exception caught in the 'close()' statement would add as a suppressed exception instead. It sounds a better approach. Therefore, I proposed to standardise those 'withXXX' function with following `withFinallyBlock` function, which does something similar to Java's try-with-resources statement. {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView)) } /** * Executes the given tryBlock and then the given finallyBlock no matter whether tryBlock throws * an exception. If both tryBlock and finallyBlock throw exceptions, the exception thrown * from the finallyBlock with be added to the exception thrown from tryBlock as a * suppress exception. It helps to avoid masking the exception from tryBlock with exception * from finallyBlock */ private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { var fromTryBlock: Throwable = null try tryBlock catch { case cause: Throwable => fromTryBlock = cause throw cause } finally { if (fromTryBlock != null) { try finallyBlock catch { case fromFinallyBlock: Throwable => fromTryBlock.addSuppressed(fromFinallyBlock) throw fromTryBlock } } else { finallyBlock } } } {code} If a feature is well written, we show not hit any exception in those closing method in testcase. The purpose of this proposal is to help developers to identify what actually break their tests. was: The current `SQLTestUtils` created many `withXXX` utility functions to clean up tables/views/caches created for testing purpose. Some of those `withXXX` functions ignore certain exceptions, like `NoSuchTableException` in the clean up block (ie, the finally block). {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { try f finally { // If the test failed part way, we don't want to mask the failure by failing to remove // temp views that never got created. try viewNames.foreach(spark.catalog.dropTempView) catch { case _: NoSuchTableException => } } } {code} Maybe it is not the best approach. Because it is hard to anticipate what exception should or should not be ignored. Java's `try-with-resources` statement does not mask exception throwing in the try block with any exception caught in the 'close()' statement. Exception caught in the 'close()' statement would add as a suppressed exception instead. IMHO, it is a better approach. Therefore, I proposed to standardise those 'withXXX' function with following `withFinallyBlock` function, which does something similar to Java's try-with-resources statement. {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) } /** * Executes the given tryBlock and then the given finallyBlock no matter whether tryBlock throws * an exception. If both tryBlock and finallyBlock throw exceptions, the exception thrown * from the finallyBlock with be added to the exception thrown from tryBlock as a * suppress exception. It helps to avoid masking the exception from tryBlock with exception * from finallyBlock */ private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { var fromTryBlock : Throwable = null try tryBlock catch { case cause : Throwable => fromTryBlock = cause throw cause } finally { if (fromTryBlock != null) { try finallyBlock catch { case fromFinallyBlock : Throwable => fromTryBlock.addSuppresse
[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852125#comment-16852125 ] Apache Spark commented on SPARK-27772: -- User 'William1104' has created a pull request for this issue: https://github.com/apache/spark/pull/24746 > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > Maybe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. Java's `try-with-resources` > statement does not mask exception throwing in the try block with any > exception caught in the 'close()' statement. Exception caught in the > 'close()' statement would add as a suppressed exception instead. IMHO, it is > a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit > = { > var fromTryBlock : Throwable = null > try tryBlock catch { > case cause : Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock : Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developer to > identify what may break in the test case. I believe masking the original > exception with any other exception is not the best approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27772: Assignee: Apache Spark > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: William Wong >Assignee: Apache Spark >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > Maybe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. Java's `try-with-resources` > statement does not mask exception throwing in the try block with any > exception caught in the 'close()' statement. Exception caught in the > 'close()' statement would add as a suppressed exception instead. IMHO, it is > a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit > = { > var fromTryBlock : Throwable = null > try tryBlock catch { > case cause : Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock : Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developer to > identify what may break in the test case. I believe masking the original > exception with any other exception is not the best approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27772: Assignee: (was: Apache Spark) > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > Maybe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. Java's `try-with-resources` > statement does not mask exception throwing in the try block with any > exception caught in the 'close()' statement. Exception caught in the > 'close()' statement would add as a suppressed exception instead. IMHO, it is > a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit > = { > var fromTryBlock : Throwable = null > try tryBlock catch { > case cause : Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock : Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developer to > identify what may break in the test case. I believe masking the original > exception with any other exception is not the best approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852124#comment-16852124 ] Apache Spark commented on SPARK-27772: -- User 'William1104' has created a pull request for this issue: https://github.com/apache/spark/pull/24746 > SQLTestUtils Refactoring > > > Key: SPARK-27772 > URL: https://issues.apache.org/jira/browse/SPARK-27772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: William Wong >Priority: Minor > > The current `SQLTestUtils` created many `withXXX` utility functions to clean > up tables/views/caches created for testing purpose. Some of those `withXXX` > functions ignore certain exceptions, like `NoSuchTableException` in the clean > up block (ie, the finally block). > > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > try f finally { > // If the test failed part way, we don't want to mask the failure by > failing to remove > // temp views that never got created. > try viewNames.foreach(spark.catalog.dropTempView) catch { > case _: NoSuchTableException => > } > } > } > {code} > Maybe it is not the best approach. Because it is hard to anticipate what > exception should or should not be ignored. Java's `try-with-resources` > statement does not mask exception throwing in the try block with any > exception caught in the 'close()' statement. Exception caught in the > 'close()' statement would add as a suppressed exception instead. IMHO, it is > a better approach. > > Therefore, I proposed to standardise those 'withXXX' function with following > `withFinallyBlock` function, which does something similar to Java's > try-with-resources statement. > {code:java} > /** > * Drops temporary view `viewNames` after calling `f`. > */ > protected def withTempView(viewNames: String*)(f: => Unit): Unit = { > withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) > } > /** > * Executes the given tryBlock and then the given finallyBlock no matter > whether tryBlock throws > * an exception. If both tryBlock and finallyBlock throw exceptions, the > exception thrown > * from the finallyBlock with be added to the exception thrown from tryBlock > as a > * suppress exception. It helps to avoid masking the exception from tryBlock > with exception > * from finallyBlock > */ > private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit > = { > var fromTryBlock : Throwable = null > try tryBlock catch { > case cause : Throwable => > fromTryBlock = cause > throw cause > } finally { > if (fromTryBlock != null) { > try finallyBlock catch { > case fromFinallyBlock : Throwable => > fromTryBlock.addSuppressed(fromFinallyBlock) > throw fromTryBlock > } > } else { > finallyBlock > } > } > } > {code} > > If a feature is well written, we show not hit any exception in those closing > method in testcase. The purpose of this proposal is to help developer to > identify what may break in the test case. I believe masking the original > exception with any other exception is not the best approach. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring
[ https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27772: - Description: The current `SQLTestUtils` created many `withXXX` utility functions to clean up tables/views/caches created for testing purpose. Some of those `withXXX` functions ignore certain exceptions, like `NoSuchTableException` in the clean up block (ie, the finally block). {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { try f finally { // If the test failed part way, we don't want to mask the failure by failing to remove // temp views that never got created. try viewNames.foreach(spark.catalog.dropTempView) catch { case _: NoSuchTableException => } } } {code} Maybe it is not the best approach. Because it is hard to anticipate what exception should or should not be ignored. Java's `try-with-resources` statement does not mask exception throwing in the try block with any exception caught in the 'close()' statement. Exception caught in the 'close()' statement would add as a suppressed exception instead. IMHO, it is a better approach. Therefore, I proposed to standardise those 'withXXX' function with following `withFinallyBlock` function, which does something similar to Java's try-with-resources statement. {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) } /** * Executes the given tryBlock and then the given finallyBlock no matter whether tryBlock throws * an exception. If both tryBlock and finallyBlock throw exceptions, the exception thrown * from the finallyBlock with be added to the exception thrown from tryBlock as a * suppress exception. It helps to avoid masking the exception from tryBlock with exception * from finallyBlock */ private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { var fromTryBlock : Throwable = null try tryBlock catch { case cause : Throwable => fromTryBlock = cause throw cause } finally { if (fromTryBlock != null) { try finallyBlock catch { case fromFinallyBlock : Throwable => fromTryBlock.addSuppressed(fromFinallyBlock) throw fromTryBlock } } else { finallyBlock } } } {code} If a feature is well written, we show not hit any exception in those closing method in testcase. The purpose of this proposal is to help developer to identify what may break in the test case. I believe masking the original exception with any other exception is not the best approach. was: The current `SQLTestUtils` created many `withXXX` utility functions to clean up tables/views/caches created for testing purpose. Some of those `withXXX` functions ignore certain exceptions, like `NoSuchTableException` in the clean up block (ie, the finally block). {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { try f finally { // If the test failed part way, we don't want to mask the failure by failing to remove // temp views that never got created. try viewNames.foreach(spark.catalog.dropTempView) catch { case _: NoSuchTableException => } } } {code} Maybe it is not the best approach. Because it is hard to anticipate what exception should or should not be ignored. Java's `try-with-resources` statement does not mask exception throwing in the try block with any exception caught in the 'close()' statement. Exception caught in the 'close()' statement would add as a suppressed exception instead. IMHO, it is a better approach. Therefore, I proposed to standardise those 'withXXX' function with following `withFinallyBlock` function, which does something similar to Java's try-with-resources statement. {code:java} /** * Drops temporary view `viewNames` after calling `f`. */ protected def withTempView(viewNames: String*)(f: => Unit): Unit = { withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView)) } /** * Executes the given tryBlock and then the given finallyBlock no matter whether tryBlock throws * an exception. If both tryBlock and finallyBlock throw exceptions, the exception thrown * from the finallyBlock with be added to the exception thrown from tryBlock as a * suppress exception. It helps to avoid masking the exception from tryBlock with exception * from finallyBlock */ private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = { var fromTryBlock : Throwable = null try tryBlock catch { case cause : Throwable => fromTryBlock = cause throw cause } finally { if (fromTryBlock != null) { try fina
[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables
[ https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852089#comment-16852089 ] Josh Rosen commented on SPARK-27785: I think this might require a little bit more design work before beginning implementation, especially along the following dimensions: # What arity do we max out at? At work I have some use-cases involving joins of 10+ tables. # Do we put this into {{Dataset}}? Or factor it out into a separate {{Multijoin}} helper object? Is the precedence for this in other frameworks (Scalding, Flink, etc) which we could mirror? # Instead of writing out each case by hand, can we write some Scala code to generate the signatures / code for us (similar to how the {{ScalaUDF}} overloads were defined)? # In addition to saving some typing / projection, does this new API let us solve SPARK-19468 for a limited subset of cases? > Introduce .joinWith() overloads for typed inner joins of 3 or more tables > - > > Key: SPARK-27785 > URL: https://issues.apache.org/jira/browse/SPARK-27785 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > Today it's rather painful to do a typed dataset join of more than two tables: > {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining > on a third inner join requires users to specify a complicated join condition > (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), > resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become > even more painful if you want to layer on a fourth join. Using {{.map()}} to > flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too. > To simplify this use case, I propose to introduce a new set of overloads of > {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some > reasonable number (say, 6). For example: > {code:java} > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > condition: Column > ): Dataset[(T, T1, T2)] > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > ds3: Dataset[T3], > condition: Column > ): Dataset[(T, T1, T2, T3)]{code} > I propose to do this only for inner joins (consistent with the default join > type for {{joinWith}} in case joins are not specified). > I haven't though about this too much yet and am not committed to the API > proposed above (it's just my initial idea), so I'm open to suggestions for > alternative typed APIs for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support
[ https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27885: -- Description: * Draft the message. * Update Spark website and announce deprecation of Python 2 support in the next major release in 2019 and remove the support in a release after 2020/01/01. It should show up in the "Latest News" section. * Announce it on users@ and dev@ was: * Draft the message. * Update Spark website and announce deprecation of Python 2 support in the next major release in 2019 and remove the support in a release after 2020/01/01. * Announce it on users@ and dev@ > Announce deprecation of Python 2 support > > > Key: SPARK-27885 > URL: https://issues.apache.org/jira/browse/SPARK-27885 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > * Draft the message. > * Update Spark website and announce deprecation of Python 2 support in the > next major release in 2019 and remove the support in a release after > 2020/01/01. It should show up in the "Latest News" section. > * Announce it on users@ and dev@ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852088#comment-16852088 ] Xiangrui Meng commented on SPARK-27886: --- cc: [~srowen] [~smilegator] > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency
[ https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27831: Assignee: (was: Apache Spark) > Move Hive test jars to maven dependency > --- > > Key: SPARK-27831 > URL: https://issues.apache.org/jira/browse/SPARK-27831 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27886: -- Description: Add Spark to https://python3statement.org/ and indicate our timeline. I reviewed the statement at https://python3statement.org/. Most projects listed there will *drop* Python 2 before 2020/01/01 instead of deprecating it with only [one exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. We certainly cannot drop Python 2 support in 2019 given we haven't deprecated it yet. Maybe we can add the following time line: * 2019/10 - 2020/04: Python 2 & 3 * 2020/04 - : Python 3 only The switching is the next release after Spark 3.0. If we want to hold another release, it would be Sept or Oct. was: Add Spark to https://python3statement.org/ and indicate our timeline. I reviewed the statement at https://python3statement.org/. Most projects listed there will *drop* Python 2 before 2020/01/01 instead of deprecating it with only [one exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. We certainly cannot drop Python 2 support in 2019 given we haven't deprecated it yet. Maybe we can add the following time line: * 2019/10 - 2020/04: Python 2 & 3 * 2020/04 - : Python 3 only > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only > The switching is the next release after Spark 3.0. If we want to hold another > release, it would be Sept or Oct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency
[ https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27831: Assignee: (was: Apache Spark) > Move Hive test jars to maven dependency > --- > > Key: SPARK-27831 > URL: https://issues.apache.org/jira/browse/SPARK-27831 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency
[ https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27831: Assignee: Apache Spark > Move Hive test jars to maven dependency > --- > > Key: SPARK-27831 > URL: https://issues.apache.org/jira/browse/SPARK-27831 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
[ https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27886: -- Description: Add Spark to https://python3statement.org/ and indicate our timeline. I reviewed the statement at https://python3statement.org/. Most projects listed there will *drop* Python 2 before 2020/01/01 instead of deprecating it with only [one exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. We certainly cannot drop Python 2 support in 2019 given we haven't deprecated it yet. Maybe we can add the following time line: * 2019/10 - 2020/04: Python 2 & 3 * 2020/04 - : Python 3 only was:Add Spark to https://python3statement.org/ and indicate our timeline. > Add Apache Spark project to https://python3statement.org/ > - > > Key: SPARK-27886 > URL: https://issues.apache.org/jira/browse/SPARK-27886 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Add Spark to https://python3statement.org/ and indicate our timeline. I > reviewed the statement at https://python3statement.org/. Most projects listed > there will *drop* Python 2 before 2020/01/01 instead of deprecating it with > only [one > exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206]. > We certainly cannot drop Python 2 support in 2019 given we haven't > deprecated it yet. > Maybe we can add the following time line: > * 2019/10 - 2020/04: Python 2 & 3 > * 2020/04 - : Python 3 only -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27831) Move Hive test jars to maven dependency
[ https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-27831: --- Assignee: (was: Yuming Wang) > Move Hive test jars to maven dependency > --- > > Key: SPARK-27831 > URL: https://issues.apache.org/jira/browse/SPARK-27831 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852077#comment-16852077 ] Xiangrui Meng commented on SPARK-27884: --- [~srowen] Could you help review the draft? Feel free to edit. > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27813) DataSourceV2: Add DropTable logical operation
[ https://issues.apache.org/jira/browse/SPARK-27813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27813. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24686 [https://github.com/apache/spark/pull/24686] > DataSourceV2: Add DropTable logical operation > - > > Key: SPARK-27813 > URL: https://issues.apache.org/jira/browse/SPARK-27813 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.3 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > Fix For: 3.0.0 > > > Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852071#comment-16852071 ] Xiangrui Meng commented on SPARK-27884: --- Draft message: *Apache Spark's plan for dropping Python 2 support* As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 support in or before 2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is an increasing burden and it essentially limits the use of Python 3 features in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 2 support as well. The current plan is as follows: * In the next major release in 2019, we will deprecate Python 2 support. PySpark users will see a deprecation warning if Python 2 is used. We will publish a migration guide for PySpark users to migrate to Python 3. * We will drop Python 2 support in a future release in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used. * For releases that support Python 2, e.g., Spark 2.4, their patch releases will continue supporting Python 2. However, after Python 2 EOL, we might not take patches that are specific to Python 2. > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27813) DataSourceV2: Add DropTable logical operation
[ https://issues.apache.org/jira/browse/SPARK-27813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27813: --- Assignee: John Zhuge > DataSourceV2: Add DropTable logical operation > - > > Key: SPARK-27813 > URL: https://issues.apache.org/jira/browse/SPARK-27813 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.3 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > > Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852055#comment-16852055 ] Gengliang Wang commented on SPARK-27798: Also, the issue can be reproduced on latest master if we use spark-shell. > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852053#comment-16852053 ] Dongjoon Hyun commented on SPARK-25557: --- Nope. The nested scheme pruning is independent from predicate pushdown. This JIRA is waiting for Parquet predicate pushdown. > ORC predicate pushdown for nested fields > > > Key: SPARK-25557 > URL: https://issues.apache.org/jira/browse/SPARK-25557 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27862) Upgrade json4s-jackson to 3.6.6
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-27862: Summary: Upgrade json4s-jackson to 3.6.6 (was: Upgrade json4s-jackson to 3.6.5) > Upgrade json4s-jackson to 3.6.6 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27884: -- Description: Officially deprecate Python 2 support in Spark 3.0. (was: Officially deprecate Python 2 support in Spark 3.0 and discuss the timeline to drop Python 2 support in a future release.) > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support
[ https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27885: -- Description: * Draft the message. * Update Spark website and announce deprecation of Python 2 support in the next major release in 2019 and remove the support in a release after 2020/01/01. * Announce it on users@ and dev@ was:Update Spark website and announce deprecation of Python 2 support in the next major release in 2019 and remove the support in a release after 2020/01/01. > Announce deprecation of Python 2 support > > > Key: SPARK-27885 > URL: https://issues.apache.org/jira/browse/SPARK-27885 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > * Draft the message. > * Update Spark website and announce deprecation of Python 2 support in the > next major release in 2019 and remove the support in a release after > 2020/01/01. > * Announce it on users@ and dev@ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27888) Python 2->3 migration guide for PySpark users
Xiangrui Meng created SPARK-27888: - Summary: Python 2->3 migration guide for PySpark users Key: SPARK-27888 URL: https://issues.apache.org/jira/browse/SPARK-27888 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng We might need a short Python 2->3 migration guide for PySpark users. It doesn't need to be comprehensive given many Python 2->3 migration guides around. We just need some pointers and list items that are specific to PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support
[ https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27885: -- Summary: Announce deprecation of Python 2 support (was: Update Spark website and put deprecation warning) > Announce deprecation of Python 2 support > > > Key: SPARK-27885 > URL: https://issues.apache.org/jira/browse/SPARK-27885 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Update Spark website and announce deprecation of Python 2 support in the next > major release in 2019 and remove the support in a release after 2020/01/01. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27887) Check python version and print deprecation warning if version < 3
[ https://issues.apache.org/jira/browse/SPARK-27887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27887: -- Description: In Spark 3.0, users should see a deprecation warning if they use PySpark with Python < 3. > Check python version and print deprecation warning if version < 3 > - > > Key: SPARK-27887 > URL: https://issues.apache.org/jira/browse/SPARK-27887 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In Spark 3.0, users should see a deprecation warning if they use PySpark with > Python < 3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27887) Check python version and print deprecation warning if version < 3
Xiangrui Meng created SPARK-27887: - Summary: Check python version and print deprecation warning if version < 3 Key: SPARK-27887 URL: https://issues.apache.org/jira/browse/SPARK-27887 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27885) Update Spark website and put deprecation warning
[ https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27885: -- Description: Update Spark website and announce deprecation of Python 2 support in the next major release in 2019 and remove the support in a release after 2020/01/01. > Update Spark website and put deprecation warning > > > Key: SPARK-27885 > URL: https://issues.apache.org/jira/browse/SPARK-27885 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Update Spark website and announce deprecation of Python 2 support in the next > major release in 2019 and remove the support in a release after 2020/01/01. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27886) Add Apache Spark project to https://python3statement.org/
Xiangrui Meng created SPARK-27886: - Summary: Add Apache Spark project to https://python3statement.org/ Key: SPARK-27886 URL: https://issues.apache.org/jira/browse/SPARK-27886 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng Add Spark to https://python3statement.org/ and indicate our timeline. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27885) Update Spark website and put deprecation warning
Xiangrui Meng created SPARK-27885: - Summary: Update Spark website and put deprecation warning Key: SPARK-27885 URL: https://issues.apache.org/jira/browse/SPARK-27885 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
Xiangrui Meng created SPARK-27884: - Summary: Deprecate Python 2 support in Spark 3.0 Key: SPARK-27884 URL: https://issues.apache.org/jira/browse/SPARK-27884 Project: Spark Issue Type: Story Components: PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng Officially deprecate Python 2 support in Spark 3.0 and discuss the timeline to drop Python 2 support in a future release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851979#comment-16851979 ] Gabor Somogyi commented on SPARK-27742: --- {quote}For the user this means better flexibility per job to setup an upper limit.{quote} * new token will be obtained at "expiryTimestamp * 0.75" * Old token will be invalidated at expiryTimestamp because not renewed Not sure if the app developer has to control the MaxLifeTime. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851966#comment-16851966 ] Gengliang Wang edited comment on SPARK-27798 at 5/30/19 3:07 PM: - Turning off the rule "ConvertToLocalRelation" should work around the problem: spark.conf.set(“spark.sql.optimizer.excludedRules”, “org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”) was (Author: gengliang.wang): Turning off the rule "ConvertToLocalRelation" should fix the problem: spark.conf.set(“spark.sql.optimizer.excludedRules”, “org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”) > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode
[ https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851966#comment-16851966 ] Gengliang Wang commented on SPARK-27798: Turning off the rule "ConvertToLocalRelation" should fix the problem: spark.conf.set(“spark.sql.optimizer.excludedRules”, “org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”) > from_avro can modify variables in other rows in local mode > -- > > Key: SPARK-27798 > URL: https://issues.apache.org/jira/browse/SPARK-27798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Yosuke Mori >Priority: Blocker > Labels: correctness > Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png > > > Steps to reproduce: > Create a local Dataset (at least two distinct rows) with a binary Avro field. > Use the {{from_avro}} function to deserialize the binary into another column. > Verify that all of the rows incorrectly have the same value. > Here's a concrete example (using Spark 2.4.3). All it does is converts a list > of TestPayload objects into binary using the defined avro schema, then tries > to deserialize using {{from_avro}} with that same schema: > {code:java} > import org.apache.avro.Schema > import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, > GenericRecordBuilder} > import org.apache.avro.io.EncoderFactory > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.avro.from_avro > import org.apache.spark.sql.functions.col > import java.io.ByteArrayOutputStream > object TestApp extends App { > // Payload container > case class TestEvent(payload: Array[Byte]) > // Deserialized Payload > case class TestPayload(message: String) > // Schema for Payload > val simpleSchema = > """ > |{ > |"type": "record", > |"name" : "Payload", > |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ] > |} > """.stripMargin > // Convert TestPayload into avro binary > def generateSimpleSchemaBinary(record: TestPayload, avsc: String): > Array[Byte] = { > val schema = new Schema.Parser().parse(avsc) > val out = new ByteArrayOutputStream() > val writer = new GenericDatumWriter[GenericRecord](schema) > val encoder = EncoderFactory.get().binaryEncoder(out, null) > val rootRecord = new GenericRecordBuilder(schema).set("message", > record.message).build() > writer.write(rootRecord, encoder) > encoder.flush() > out.toByteArray > } > val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > List( > TestPayload("one"), > TestPayload("two"), > TestPayload("three"), > TestPayload("four") > ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, > simpleSchema))) > .toDS() > .withColumn("deserializedPayload", from_avro(col("payload"), > simpleSchema)) > .show(truncate = false) > } > {code} > And here is what this program outputs: > {noformat} > +--+---+ > |payload |deserializedPayload| > +--+---+ > |[00 06 6F 6E 65] |[four] | > |[00 06 74 77 6F] |[four] | > |[00 0A 74 68 72 65 65]|[four] | > |[00 08 66 6F 75 72] |[four] | > +--+---+{noformat} > Here, we can see that the avro binary is correctly generated, but the > deserialized version is a copy of the last row. I have not yet verified that > this is an issue in cluster mode as well. > > I dug into a bit more of the code and it seems like the resuse of {{result}} > in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. > I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all > point to the same address in memory - and therefore a mutation in one > variable will cause all of it to mutate. > !Screen Shot 2019-05-21 at 2.39.27 PM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27706) Add SQL metrics of numOutputRows for BroadcastExchangeExec
[ https://issues.apache.org/jira/browse/SPARK-27706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-27706. - > Add SQL metrics of numOutputRows for BroadcastExchangeExec > -- > > Key: SPARK-27706 > URL: https://issues.apache.org/jira/browse/SPARK-27706 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > Add SQL metrics of numOutputRows for BroadcastExchangeExec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:58 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark batch apps and you want to setup limits for administration reasons eg. no user is allowed to have access more than 5 days (for streaming jobs you need no limits). Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits for administration reasons eg. no user is allowed to have access more than 5 days (for streaming jobs you need no limits). Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:57 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits for administration reasons eg. no user is allowed to have access more than 5 days (for streaming jobs you need no limits). Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits for administration reasons eg. no user is allowed to have access more than 5 days. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:55 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits for administration reasons eg. no user is allowed to have access more than 5 days. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:55 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user to have a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user renew a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:53 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user renew a ticket for ever, if not mistaken, which is less restrictive than setting a hard limit. Imagine you have multiple Spark apps and you want to setup limits. Anyway my 2 cents. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user renew a ticket for ever, if not mistaken, which is less restrictive than the setting a hard limit. Imagine I have multiple Spark apps and I want to setup limits. Anyway some thoughts. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851924#comment-16851924 ] Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:53 PM: -- {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. From a security perspective this allows the user renew a ticket for ever, if not mistaken, which is less restrictive than the setting a hard limit. Imagine I have multiple Spark apps and I want to setup limits. Anyway some thoughts. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. was (Author: skonto): {quote}From client side the max lifetime can be only decreased for security reasons + see my previous point. {quote} In general since Kafka allows me to set this value, I should be able to do so. "Max lifetime for the token in milliseconds. If the value is -1, then MaxLifeTime will default to a server side config value (delegation.token.max.lifetime.ms)." For the user this means better flexibility per job to setup an upper limit. Again I see the point repeatedly get new tokens, no max life time. {quote}There is no possibility to obtain token for anybody else (pls see the comment in the code). {quote} When proxy user will be supported I guess there will be. > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851952#comment-16851952 ] Liang-Chi Hsieh commented on SPARK-27873: - I can prepare a PR if Marcin or Hyukjin Kwon don't plan to do. > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false
[ https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851948#comment-16851948 ] Liang-Chi Hsieh commented on SPARK-27873: - I guess what Marcin meant is: {code} val schema = StructType.fromDDL("a int, b date") val columnNameOfCorruptRecord = "_unparsed" val schemaWithCorrField1 = schema.add(columnNameOfCorruptRecord, StringType) val df = spark .read .option("mode", "Permissive") .option("header", "true") .option("enforceSchema", false) .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord) .schema(schemaWithCorrField1) .csv(testFile(valueMalformedWithHeaderFile)) {code} If we want to keep corrupt record, we provide a new column into the schema. But this new column isn't in CSV header. So if enforceSchema is disable at the same time, CSVHeaderChecker throws a exception like: {code} [info] Cause: java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: [info] Header length: 2, schema size: 3 {code} It is because CSVHeaderChecker doesn't consider columnNameOfCorruptRecord for now. > Csv reader, adding a corrupt record column causes error if enforceSchema=false > -- > > Key: SPARK-27873 > URL: https://issues.apache.org/jira/browse/SPARK-27873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Marcin Mejran >Priority: Major > > In the Spark CSV reader If you're using permissive mode with a column for > storing corrupt records then you need to add a new schema column > corresponding to columnNameOfCorruptRecord. > However, if you have a header row and enforceSchema=false the schema vs. > header validation fails because there is an extra column corresponding to > columnNameOfCorruptRecord. > Since, the FAILFAST mode doesn't print informative error messages on which > rows failed to parse there is no way other to track down broken rows without > setting a corrupt record column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27757) Bump Jackson to 2.9.9
[ https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27757: -- Priority: Minor (was: Major) > Bump Jackson to 2.9.9 > - > > Key: SPARK-27757 > URL: https://issues.apache.org/jira/browse/SPARK-27757 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Minor > Fix For: 3.0.0 > > > This fixes CVE-2019-12086 on Databind: > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27757) Bump Jackson to 2.9.9
[ https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27757: - Assignee: Fokko Driesprong > Bump Jackson to 2.9.9 > - > > Key: SPARK-27757 > URL: https://issues.apache.org/jira/browse/SPARK-27757 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > > This fixes CVE-2019-12086 on Databind: > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27757) Bump Jackson to 2.9.9
[ https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27757. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24646 [https://github.com/apache/spark/pull/24646] > Bump Jackson to 2.9.9 > - > > Key: SPARK-27757 > URL: https://issues.apache.org/jira/browse/SPARK-27757 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 3.0.0 > > > This fixes CVE-2019-12086 on Databind: > https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27883) Add aggregates.sql - Part2
[ https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27883: Assignee: (was: Apache Spark) > Add aggregates.sql - Part2 > -- > > Key: SPARK-27883 > URL: https://issues.apache.org/jira/browse/SPARK-27883 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27883) Add aggregates.sql - Part2
[ https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27883: Assignee: Apache Spark > Add aggregates.sql - Part2 > -- > > Key: SPARK-27883 > URL: https://issues.apache.org/jira/browse/SPARK-27883 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org