[jira] [Comment Edited] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305108#comment-16305108 ] Hyukjin Kwon edited comment on SPARK-7721 at 12/28/17 7:11 AM: --- I roughly checked the coverage results and seems fine. There is one trivial nit tho - https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169 this scope is not in the coverage results as basically I am producing the coverage results in {{worker.py}} separately and then merging it. I believe it's not a big deal. So, if you are fine for all now, how about if i proceed this by two PRs 1. Adding the script only (of course after cleaning up) Adding script alone should also be useful when reviewers check PRs, they can at least manually run it. 2. Integrating with Jenkins I have two thoughts for this: - Simplest one: Only run it in a specific mater in Jenkins and we always only keep a single up-to-date coverage site. It's simple. We can just simply push it. I think this is quite straightforward and pretty feasible. - Another one: I make a simple site in the git pages to list up all other coverages of all other builds (including PR builds). We push the coverage html in Jenkins, and then leave a link in each PR's Jenkins build success message. I think this's also feasible but I think I need to take a look further. BTW, I will be able to start to work on this from next week or two weeks after .. was (Author: hyukjin.kwon): I roughly checked the coverage results and seems fine. There is one trivial nit tho - https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169 this scope is not in the coverage results as basically I am producing the coverage results in {{worker.py}} separately and then merging it. I believe it's not a big deal. So, if you are fine for all now, how about if i proceed this by two PRs 1. Adding the script only (of course after cleaning up) Adding script alone should also be useful when reviewers check PRs, they can at least manually run it. 2. Integrating with Jenkins I have two thoughts for this: - Simplest one: Only run it in a specific mater in Jenkins and we always only keep a single up-to-date coverage site. It's simple. We can just simply push it. I think this is quite straightforward and pretty feasible. - Another one: I make a simple site to list up all other coverages of all other builds (including PR builds) in git pages, and then leave a link in each PR's Jenkins build success message. I think this's also feasible but I think I need to take a look further. BTW, I will be able to start to work on this from next week or two weeks after .. > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305108#comment-16305108 ] Hyukjin Kwon commented on SPARK-7721: - I roughly checked the coverage results and seems fine. There is one trivial nit tho - https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169 this scope is not in the coverage results as basically I am producing the coverage results in {{worker.py}} separately and then merging it. I believe it's not a big deal. So, if you are fine for all now, how about if i proceed this by two PRs 1. Adding the script only (of course after cleaning up) Adding script alone should also be useful when reviewers check PRs, they can at least manually run it. 2. Integrating with Jenkins I have two thoughts for this: - Simplest one: Only run it in a specific mater in Jenkins and we always only keep a single up-to-date coverage site. It's simple. We can just simply push it. I think this is quite straightforward and pretty feasible. - Another one: I make a simple site to list up all other coverages of all other builds (including PR builds) in git pages, and then leave a link in each PR's Jenkins build success message. I think this's also feasible but I think I need to take a look further. BTW, I will be able to start to work on this from next week or two weeks after .. > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305103#comment-16305103 ] Hyukjin Kwon commented on SPARK-7721: - Hey [~rxin], I think I made it now by few modification of the script and forcing {{worker.py}} to produce the coverage results. I ran it by Python 3 and Coverage 4.4 and all tests passed and just updated the site - https://spark-test.github.io/pyspark-coverage-site FYI, here is the diff I used in the main codes to force it to produces (15ish lines addition) {code} diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py index e6737ae1c12..088debcf796 100644 --- a/python/pyspark/worker.py +++ b/python/pyspark/worker.py @@ -159,7 +159,7 @@ def read_udfs(pickleSer, infile, eval_type): return func, None, ser, ser -def main(infile, outfile): +def _main(infile, outfile): try: boot_time = time.time() split_index = read_int(infile) @@ -259,6 +259,22 @@ def main(infile, outfile): exit(-1) +if "COVERAGE_PROCESS_START" in os.environ: +def _cov_wrapped(*args, **kwargs): +import coverage +cov = coverage.coverage( +config_file=os.environ["COVERAGE_PROCESS_START"]) +cov.start() +try: +_main(*args, **kwargs) +finally: +cov.stop() +cov.save() +main = _cov_wrapped +else: +main = _main + + if __name__ == '__main__': # Read a local port to connect to from stdin java_port = int(sys.stdin.readline()) {code} > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-22757. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19954 [https://github.com/apache/spark/pull/19954] > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-22757: - Assignee: Yinan Li > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22909. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20093 [https://github.com/apache/spark/pull/20093] > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305032#comment-16305032 ] Joseph K. Bradley commented on SPARK-22883: --- I'll work on this. > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305018#comment-16305018 ] Ameen Tayyebi edited comment on SPARK-22913 at 12/28/17 3:31 AM: - Note: I'll be away from December 30th until January 18th so I'll be checking up on this issue and the pull request at that time. was (Author: ameen.tayy...@gmail.com): Note: I'll be away since December 30th until January 18th so I'll be checking up on this issue and the pull request at that time. > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305018#comment-16305018 ] Ameen Tayyebi commented on SPARK-22913: --- Note: I'll be away since December 30th until January 18th so I'll be checking up on this issue and the pull request at that time. > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305017#comment-16305017 ] Apache Spark commented on SPARK-22913: -- User 'ameent' has created a pull request for this issue: https://github.com/apache/spark/pull/20100 > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22913: Assignee: Apache Spark > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi >Assignee: Apache Spark > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22913: Assignee: (was: Apache Spark) > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22899) OneVsRestModel transform on streaming data failed.
[ https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22899. --- Resolution: Fixed Fix Version/s: 2.3.0 Resolved by https://github.com/apache/spark/pull/20077 > OneVsRestModel transform on streaming data failed. > -- > > Key: SPARK-22899 > URL: https://issues.apache.org/jira/browse/SPARK-22899 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.3.0 > > > OneVsRestModel transform on streaming data failed. > Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22899) OneVsRestModel transform on streaming data failed.
[ https://issues.apache.org/jira/browse/SPARK-22899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22899: - Assignee: Weichen Xu > OneVsRestModel transform on streaming data failed. > -- > > Key: SPARK-22899 > URL: https://issues.apache.org/jira/browse/SPARK-22899 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Weichen Xu >Assignee: Weichen Xu > > OneVsRestModel transform on streaming data failed. > Because of it persisting the input dataset, which streaming do not support. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22888) OneVsRestModel does not work with Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-22888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22888: -- Target Version/s: (was: 2.3.0) > OneVsRestModel does not work with Structured Streaming > -- > > Key: SPARK-22888 > URL: https://issues.apache.org/jira/browse/SPARK-22888 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Joseph K. Bradley >Priority: Critical > > OneVsRestModel uses Dataset.persist, which does not work with streaming. > This should be avoided when the input is a streaming Dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22888) OneVsRestModel does not work with Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-22888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22888. --- Resolution: Duplicate > OneVsRestModel does not work with Structured Streaming > -- > > Key: SPARK-22888 > URL: https://issues.apache.org/jira/browse/SPARK-22888 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Joseph K. Bradley >Priority: Critical > > OneVsRestModel uses Dataset.persist, which does not work with streaming. > This should be avoided when the input is a streaming Dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22905) Fix ChiSqSelectorModel save implementation
[ https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22905: -- Shepherd: Joseph K. Bradley > Fix ChiSqSelectorModel save implementation > -- > > Key: SPARK-22905 > URL: https://issues.apache.org/jira/browse/SPARK-22905 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, in `ChiSqSelectorModel`, save: > {code} > spark.createDataFrame(dataArray).repartition(1).write... > {code} > The default partition number used by createDataFrame is "defaultParallelism", > Current RoundRobinPartitioning won't guarantee the "repartition" generating > the same order result with local array. We need fix it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22905) Fix ChiSqSelectorModel save implementation
[ https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22905: -- Component/s: (was: ML) > Fix ChiSqSelectorModel save implementation > -- > > Key: SPARK-22905 > URL: https://issues.apache.org/jira/browse/SPARK-22905 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, in `ChiSqSelectorModel`, save: > {code} > spark.createDataFrame(dataArray).repartition(1).write... > {code} > The default partition number used by createDataFrame is "defaultParallelism", > Current RoundRobinPartitioning won't guarantee the "repartition" generating > the same order result with local array. We need fix it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22916) shouldn't bias towards build right if user does not specify
[ https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304963#comment-16304963 ] Apache Spark commented on SPARK-22916: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20099 > shouldn't bias towards build right if user does not specify > --- > > Key: SPARK-22916 > URL: https://issues.apache.org/jira/browse/SPARK-22916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Feng Liu > > This is an issue very similar to SPARK-22489. When there are no broadcast > hints, the current spark strategies will prefer to build right, without > considering the sizes of the two sides. To reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = > t2.key").queryExecution.executedPlan > {code} > The plan is going to broadcast right side (`t2`), even though it is larger. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22916) shouldn't bias towards build right if user does not specify
[ https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22916: Assignee: Apache Spark > shouldn't bias towards build right if user does not specify > --- > > Key: SPARK-22916 > URL: https://issues.apache.org/jira/browse/SPARK-22916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Feng Liu >Assignee: Apache Spark > > This is an issue very similar to SPARK-22489. When there are no broadcast > hints, the current spark strategies will prefer to build right, without > considering the sizes of the two sides. To reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = > t2.key").queryExecution.executedPlan > {code} > The plan is going to broadcast right side (`t2`), even though it is larger. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22916) shouldn't bias towards build right if user does not specify
[ https://issues.apache.org/jira/browse/SPARK-22916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22916: Assignee: (was: Apache Spark) > shouldn't bias towards build right if user does not specify > --- > > Key: SPARK-22916 > URL: https://issues.apache.org/jira/browse/SPARK-22916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Feng Liu > > This is an issue very similar to SPARK-22489. When there are no broadcast > hints, the current spark strategies will prefer to build right, without > considering the sizes of the two sides. To reproduce: > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value").createTempView("table1") > spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", > "value").createTempView("table2") > val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = > t2.key").queryExecution.executedPlan > {code} > The plan is going to broadcast right side (`t2`), even though it is larger. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22916) shouldn't bias towards build right if user does not specify
Feng Liu created SPARK-22916: Summary: shouldn't bias towards build right if user does not specify Key: SPARK-22916 URL: https://issues.apache.org/jira/browse/SPARK-22916 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Feng Liu This is an issue very similar to SPARK-22489. When there are no broadcast hints, the current spark strategies will prefer to build right, without considering the sizes of the two sides. To reproduce: {code:java} import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value").createTempView("table1") spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "value").createTempView("table2") val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = t2.key").queryExecution.executedPlan {code} The plan is going to broadcast right side (`t2`), even though it is larger. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22915: -- Description: *For featurizers with names from N - Z* Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 was: *For featurizers with names from A - M* Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Summary: ML test for StructuredStreaming: spark.ml.feature, A-M (was: ML test for StructuredStreaming: spark.ml.feature) > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Description: *For featurizers with names from A - M* Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 was: Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
Joseph K. Bradley created SPARK-22915: - Summary: ML test for StructuredStreaming: spark.ml.feature, N-Z Key: SPARK-22915 URL: https://issues.apache.org/jira/browse/SPARK-22915 Project: Spark Issue Type: Test Components: ML, Tests Affects Versions: 2.3.0 Reporter: Joseph K. Bradley *For featurizers with names from A - M* Task for adding Structured Streaming tests for all Models/Transformers in a sub-module in spark.ml For an example, see LinearRegressionSuite.scala in https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default
[ https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22914: Assignee: Apache Spark > Subbing for spark.history.ui.port does not resolve by default > - > > Key: SPARK-22914 > URL: https://issues.apache.org/jira/browse/SPARK-22914 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Apache Spark > > In order not to hardcode SHS web ui port and not duplicate information that > is already configured we might be inclined to define > {{spark.yarn.historyServer.address}} as > {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code} > However, since spark.history.ui.port is not registered its resolution fails > when it's not explicitly set in the deployed spark conf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default
[ https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22914: Assignee: (was: Apache Spark) > Subbing for spark.history.ui.port does not resolve by default > - > > Key: SPARK-22914 > URL: https://issues.apache.org/jira/browse/SPARK-22914 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > In order not to hardcode SHS web ui port and not duplicate information that > is already configured we might be inclined to define > {{spark.yarn.historyServer.address}} as > {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code} > However, since spark.history.ui.port is not registered its resolution fails > when it's not explicitly set in the deployed spark conf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default
[ https://issues.apache.org/jira/browse/SPARK-22914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304863#comment-16304863 ] Apache Spark commented on SPARK-22914: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/20098 > Subbing for spark.history.ui.port does not resolve by default > - > > Key: SPARK-22914 > URL: https://issues.apache.org/jira/browse/SPARK-22914 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > In order not to hardcode SHS web ui port and not duplicate information that > is already configured we might be inclined to define > {{spark.yarn.historyServer.address}} as > {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code} > However, since spark.history.ui.port is not registered its resolution fails > when it's not explicitly set in the deployed spark conf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default
Gera Shegalov created SPARK-22914: - Summary: Subbing for spark.history.ui.port does not resolve by default Key: SPARK-22914 URL: https://issues.apache.org/jira/browse/SPARK-22914 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov In order not to hardcode SHS web ui port and not duplicate information that is already configured we might be inclined to define {{spark.yarn.historyServer.address}} as {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code} However, since spark.history.ui.port is not registered its resolution fails when it's not explicitly set in the deployed spark conf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
Ameen Tayyebi created SPARK-22913: - Summary: Hive Partition Pruning, Fractional and Timestamp types Key: SPARK-22913 URL: https://issues.apache.org/jira/browse/SPARK-22913 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Ameen Tayyebi Fix For: 2.3.0 Spark currently pushes the predicates it has in the SQL query to Hive Metastore. This only applies to predicates that are placed on top of partitioning columns. As more and more hive metastore implementations come around, this is an important optimization to allow data to be prefiltered to only relevant partitions. Consider the following example: Table: create external table data (key string, quantity long) partitioned by (processing-date timestamp) Query: select * from data where processing-date = '2017-10-23 00:00:00' Currently, no filters will be pushed to the hive metastore for the above query. The reason is that the code that tries to compute predicates to be sent to hive metastore, only deals with integral and string column types. It doesn't know how to handle fractional and timestamp columns. I have tables in my metastore (AWS Glue) with millions of partitions of type timestamp and double. In my specific case, it takes Spark's master node about 6.5 minutes to download all partitions for the table, and then filter the partitions client-side. The actual processing time of my query is only 6 seconds. In other words, without partition pruning, I'm looking at 6.5 minutes of processing and with partition pruning, I'm looking at 6 seconds only. I have a fix for this developed locally that I'll provide shortly as a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution
[ https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304817#comment-16304817 ] Apache Spark commented on SPARK-22912: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20097 > Support v2 streaming sources and sinks in MicroBatchExecution > - > > Key: SPARK-22912 > URL: https://issues.apache.org/jira/browse/SPARK-22912 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution
[ https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22912: Assignee: (was: Apache Spark) > Support v2 streaming sources and sinks in MicroBatchExecution > - > > Key: SPARK-22912 > URL: https://issues.apache.org/jira/browse/SPARK-22912 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution
[ https://issues.apache.org/jira/browse/SPARK-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22912: Assignee: Apache Spark > Support v2 streaming sources and sinks in MicroBatchExecution > - > > Key: SPARK-22912 > URL: https://issues.apache.org/jira/browse/SPARK-22912 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22912) Support v2 streaming sources and sinks in MicroBatchExecution
Jose Torres created SPARK-22912: --- Summary: Support v2 streaming sources and sinks in MicroBatchExecution Key: SPARK-22912 URL: https://issues.apache.org/jira/browse/SPARK-22912 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22911) Migrate structured streaming sources to new DataSourceV2 APIs
Jose Torres created SPARK-22911: --- Summary: Migrate structured streaming sources to new DataSourceV2 APIs Key: SPARK-22911 URL: https://issues.apache.org/jira/browse/SPARK-22911 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22908) add basic continuous kafka source
[ https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22908: Assignee: (was: Apache Spark) > add basic continuous kafka source > - > > Key: SPARK-22908 > URL: https://issues.apache.org/jira/browse/SPARK-22908 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22908) add basic continuous kafka source
[ https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22908: Assignee: Apache Spark > add basic continuous kafka source > - > > Key: SPARK-22908 > URL: https://issues.apache.org/jira/browse/SPARK-22908 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22908) add basic continuous kafka source
[ https://issues.apache.org/jira/browse/SPARK-22908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304811#comment-16304811 ] Apache Spark commented on SPARK-22908: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20096 > add basic continuous kafka source > - > > Key: SPARK-22908 > URL: https://issues.apache.org/jira/browse/SPARK-22908 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304765#comment-16304765 ] Apache Spark commented on SPARK-22126: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/20095 > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { modelMap => > modelMap.map { case (index: Int, model: Model[_]) => > val metric = eval.evaluate(model.transform(validationDataset, > paramMaps(index))) > (index, metric)
[jira] [Created] (SPARK-22910) Wrong results in Spark Job because failed to move to Trash
Ohad Raviv created SPARK-22910: -- Summary: Wrong results in Spark Job because failed to move to Trash Key: SPARK-22910 URL: https://issues.apache.org/jira/browse/SPARK-22910 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.0 Reporter: Ohad Raviv Our Spark job has completed with status successful although the data save was corrupted. What happened is that we have a monthly job. each run overwrites the output of the previous run. we happened to change the sql.shuffle.partitions number between the runs from 2000 to 1000, and what happened was that the new run had Warn failure of moving the old data to the user's .Trash because it was full. because it was only a warning the process continued and overwritten the new 1000 files - while leaving most of the old remaining 1000 files in their place. this resulted that in the final output we had a folder with mix of old and new data and that caused corruption in the process. the post mortem is relatively easy to understand. {code} hadoop fs -ls /the/folder -rwxr-xr-x 3 spark_user spark_user 209012005 2017-12-10 14:20 /the/folder/part-0.gz . . -rwxr-xr-x 3 spark_user spark_user 34899 2017-11-17 06:39 /the/folder/part-01990.gz {code} and in the driver's log: {code} 17/12/10 15:10:00 WARN Hive: Directory hdfs:///the/folder cannot be removed: java.io.IOException: Failed to move to trash: hdfs:///the/folder/part-0.gz java.io.IOException: Failed to move to trash: hdfs:///the/folder/part-0.gz at org.apache.hadoop.fs.TrashPolicyDefault.moveToTrash(TrashPolicyDefault.java:160) at org.apache.hadoop.fs.Trash.moveToTrash(Trash.java:109) at org.apache.hadoop.fs.Trash.moveToAppropriateTrash(Trash.java:90) at org.apache.hadoop.hive.shims.Hadoop23Shims.moveToAppropriateTrash(Hadoop23Shims.java:272) at org.apache.hadoop.hive.common.FileUtils.moveToTrash(FileUtils.java:603) at org.apache.hadoop.hive.common.FileUtils.trashFilesUnderDir(FileUtils.java:586) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2851) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1640) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:716) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272) at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:671) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:741) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.Sp
[jira] [Updated] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22909: - Target Version/s: 2.3.0 > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22909: - Priority: Blocker (was: Major) > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304734#comment-16304734 ] Apache Spark commented on SPARK-20392: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20094 > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c
[jira] [Commented] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304729#comment-16304729 ] Apache Spark commented on SPARK-22909: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/20093 > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22909: Assignee: Apache Spark (was: Shixiong Zhu) > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
[ https://issues.apache.org/jira/browse/SPARK-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22909: Assignee: Shixiong Zhu (was: Apache Spark) > Move Structured Streaming v2 APIs to streaming package > -- > > Key: SPARK-22909 > URL: https://issues.apache.org/jira/browse/SPARK-22909 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22909) Move Structured Streaming v2 APIs to streaming package
Shixiong Zhu created SPARK-22909: Summary: Move Structured Streaming v2 APIs to streaming package Key: SPARK-22909 URL: https://issues.apache.org/jira/browse/SPARK-22909 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22908) add basic continuous kafka source
Jose Torres created SPARK-22908: --- Summary: add basic continuous kafka source Key: SPARK-22908 URL: https://issues.apache.org/jira/browse/SPARK-22908 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22465) Cogroup of two disproportionate RDDs could lead into 2G limit BUG
[ https://issues.apache.org/jira/browse/SPARK-22465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304569#comment-16304569 ] Apache Spark commented on SPARK-22465: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/20091 > Cogroup of two disproportionate RDDs could lead into 2G limit BUG > - > > Key: SPARK-22465 > URL: https://issues.apache.org/jira/browse/SPARK-22465 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, > 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, > 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0 >Reporter: Amit Kumar >Priority: Critical > Fix For: 2.3.0 > > > While running my spark pipeline, it failed with the following exception > {noformat} > 2017-11-03 04:49:09,776 [Executor task launch worker for task 58670] ERROR > org.apache.spark.executor.Executor - Exception in task 630.0 in stage 28.0 > (TID 58670) > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:469) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:705) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > After debugging I found that the issue lies with how spark handles cogroup of > two RDDs. > Here is the relevant code from apache spark > {noformat} > /** >* For each key k in `this` or `other`, return a resulting RDD that > contains a tuple with the >* list of values for that key in `this` as well as `other`. >*/ > def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = > self.withScope { > cogroup(other, defaultPartitioner(self, other)) > } > /** >* Choose a partitioner to use for a cogroup-like operation between a > number of RDDs. >* >* If any of the RDDs already has a partitioner, choose that one. >* >* Otherwise, we use a default HashPartitioner. For the number of > partitions, if >* spark.default.parallelism is set, then we'll use the value from > SparkContext >* defaultParallelism, otherwise we'll use the max number of upstream > partitions. >* >* Unless spark.default.parallelism is set, the number of partitions will > be the >* same as the number of partitions in the largest upstream RDD, as this > should >* be least likely to cause out-of-memory errors. >* >* We use two method parameters (rdd, others) to enforce callers passing at > least 1 RDD. >*/ > def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { > val rdds = (Seq(rdd) ++ others) > val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > > 0)) > if (hasPartitioner.nonEmpty) { > hasPartitioner.maxBy(_.partitions.length).partitioner.get > } else { > if (rdd.context.conf.contains("spark.default.parallelism")) { > new HashPartitioner(rdd.context.defaultParallelism) > } else { > new HashPartitioner(rdds.map(_.partitions.length).max) > } > } > } > {noformat} > Given this suppose we have two pair RDDs. > RDD1 : A small RDD which fewer data and partitions > RDD2: A huge RDD which has loads of data and partitions > Now in the code if we were to have a cogroup > {noformat} > val RDD3 = RDD1.cogroup(RDD2) > {noformat} > there is a case where this could lead to the SPARK-6235 Bug which is If RDD1 > has a partitioner when it is being called into a cogroup. This is because the > cogroups partitions are then decided by the partitioner and could lead to the > huge RDD2 being shuffled into a small number of partitions. > One way is probably to add a safety check here that woul
[jira] [Commented] (SPARK-22907) MetadataFetchFailedException broadcast is already present
[ https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304396#comment-16304396 ] Apache Spark commented on SPARK-22907: -- User 'liupc' has created a pull request for this issue: https://github.com/apache/spark/pull/20090 > MetadataFetchFailedException broadcast is already present > - > > Key: SPARK-22907 > URL: https://issues.apache.org/jira/browse/SPARK-22907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0 > Environment: Spark2.1.0 + yarn >Reporter: liupengcheng > > Currently, some IOException(may caused by physical environment) may cause > MetadataFetchFailedException, and the stage will be retried. however, when > in the retrying of the stage, if the task is scheduled at the same executors > where the first MetadataFetchFailedException happens, a > 'MetadataFetchFailedException: broadcast is already present' will be thrown. > {noformat} > org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: > java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is > already present in the MemoryStore > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at org.apache.spark.Logging$class.logInfo(Logging.scala:58) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112) > at > org.apa
[jira] [Assigned] (SPARK-22907) MetadataFetchFailedException broadcast is already present
[ https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22907: Assignee: (was: Apache Spark) > MetadataFetchFailedException broadcast is already present > - > > Key: SPARK-22907 > URL: https://issues.apache.org/jira/browse/SPARK-22907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0 > Environment: Spark2.1.0 + yarn >Reporter: liupengcheng > > Currently, some IOException(may caused by physical environment) may cause > MetadataFetchFailedException, and the stage will be retried. however, when > in the retrying of the stage, if the task is scheduled at the same executors > where the first MetadataFetchFailedException happens, a > 'MetadataFetchFailedException: broadcast is already present' will be thrown. > {noformat} > org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: > java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is > already present in the MemoryStore > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at org.apache.spark.Logging$class.logInfo(Logging.scala:58) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFu
[jira] [Assigned] (SPARK-22907) MetadataFetchFailedException broadcast is already present
[ https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22907: Assignee: Apache Spark > MetadataFetchFailedException broadcast is already present > - > > Key: SPARK-22907 > URL: https://issues.apache.org/jira/browse/SPARK-22907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0 > Environment: Spark2.1.0 + yarn >Reporter: liupengcheng >Assignee: Apache Spark > > Currently, some IOException(may caused by physical environment) may cause > MetadataFetchFailedException, and the stage will be retried. however, when > in the retrying of the stage, if the task is scheduled at the same executors > where the first MetadataFetchFailedException happens, a > 'MetadataFetchFailedException: broadcast is already present' will be thrown. > {noformat} > org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: > java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is > already present in the MemoryStore > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at org.apache.spark.Logging$class.logInfo(Logging.scala:58) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfu
[jira] [Updated] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
[ https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-22903: - Flags: Important > AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber > - > > Key: SPARK-22903 > URL: https://issues.apache.org/jira/browse/SPARK-22903 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0 > Environment: Spark2.1.0 + yarn >Reporter: liupengcheng > Labels: core > > We submit a Spark2.1.0 spark job, however, when MetadataFetchFailed > exception ocurred, stage is being retried, but a AlreadyBeingCreatedException > is thrown and finally caused job failure. > {noformat} > 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 13.0 in stage 7.1 (TID 18990, , executor 326): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): > Failed to create file > [//_temporary/0/_temporary/attempt_201712211720_0026_r_14_0/part-r-00014.snappy.parquet] > for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10], > because this file is already being created by > [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at org.apache.hadoop.ipc.Client.call(Client.java:1477) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy21.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy22.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806) > at > org.apache.parque
[jira] [Commented] (SPARK-22907) MetadataFetchFailedException broadcast is already present
[ https://issues.apache.org/jira/browse/SPARK-22907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304389#comment-16304389 ] liupengcheng commented on SPARK-22907: -- I think we should cleanup the garbage broadcast when IOException occurred at the reading of the mapstatuses broadcast. thus, we can avoid task of retry stage throw the 'MetadataFetchFaieldException; broadcast is already present' exception. > MetadataFetchFailedException broadcast is already present > - > > Key: SPARK-22907 > URL: https://issues.apache.org/jira/browse/SPARK-22907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.0 > Environment: Spark2.1.0 + yarn >Reporter: liupengcheng > > Currently, some IOException(may caused by physical environment) may cause > MetadataFetchFailedException, and the stage will be retried. however, when > in the retrying of the stage, if the task is scheduled at the same executors > where the first MetadataFetchFailedException happens, a > 'MetadataFetchFailedException: broadcast is already present' will be thrown. > {noformat} > org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: > java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is > already present in the MemoryStore > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at > org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) > at org.apache.spark.Logging$class.logInfo(Logging.scala:58) > at > org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612) > at > org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674) > at > org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202) > at > org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96) > at > org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112) > at > org.apache.spark.
[jira] [Created] (SPARK-22907) MetadataFetchFailedException broadcast is already present
liupengcheng created SPARK-22907: Summary: MetadataFetchFailedException broadcast is already present Key: SPARK-22907 URL: https://issues.apache.org/jira/browse/SPARK-22907 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.3.0 Environment: Spark2.1.0 + yarn Reporter: liupengcheng Currently, some IOException(may caused by physical environment) may cause MetadataFetchFailedException, and the stage will be retried. however, when in the retrying of the stage, if the task is scheduled at the same executors where the first MetadataFetchFailedException happens, a 'MetadataFetchFailedException: broadcast is already present' will be thrown. {noformat} org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: java.lang.IllegalArgumentException: requirement failed: Block broadcast_22 is already present in the MemoryStore at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1224) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:167) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) at org.apache.spark.MapOutputTracker$$anonfun$deserializeMapStatuses$1.apply(MapOutputTracker.scala:675) at org.apache.spark.Logging$class.logInfo(Logging.scala:58) at org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:612) at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:674) at org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:202) at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:141) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:55) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1112) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1252) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1120) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
[jira] [Assigned] (SPARK-22906) External shuffle IP different from Host ip
[ https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22906: Assignee: (was: Apache Spark) > External shuffle IP different from Host ip > -- > > Key: SPARK-22906 > URL: https://issues.apache.org/jira/browse/SPARK-22906 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.1 >Reporter: Unai Sarasola >Priority: Minor > > Now is possible to configure the spark.shuffle.service.port, but there aren't > an equivalent for host. Imagine that you are using an external shuffle > deployed in Docker. > If you aren't using Host mode for the docker, you may want to use the > internal ip address of the docker to connect to the external shuffle service. > Also you could use Calico, or just being attached the spark shuffle to a > different IP that is used in the Spark executor (example hosts with multiple > network interfaces). > So this is why I implemented the spark.shuffle.service.host configuration -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22906) External shuffle IP different from Host ip
[ https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22906: Assignee: Apache Spark > External shuffle IP different from Host ip > -- > > Key: SPARK-22906 > URL: https://issues.apache.org/jira/browse/SPARK-22906 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.1 >Reporter: Unai Sarasola >Assignee: Apache Spark >Priority: Minor > > Now is possible to configure the spark.shuffle.service.port, but there aren't > an equivalent for host. Imagine that you are using an external shuffle > deployed in Docker. > If you aren't using Host mode for the docker, you may want to use the > internal ip address of the docker to connect to the external shuffle service. > Also you could use Calico, or just being attached the spark shuffle to a > different IP that is used in the Spark executor (example hosts with multiple > network interfaces). > So this is why I implemented the spark.shuffle.service.host configuration -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22906) External shuffle IP different from Host ip
[ https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304382#comment-16304382 ] Apache Spark commented on SPARK-22906: -- User 'pianista215' has created a pull request for this issue: https://github.com/apache/spark/pull/20083 > External shuffle IP different from Host ip > -- > > Key: SPARK-22906 > URL: https://issues.apache.org/jira/browse/SPARK-22906 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.1 >Reporter: Unai Sarasola >Priority: Minor > > Now is possible to configure the spark.shuffle.service.port, but there aren't > an equivalent for host. Imagine that you are using an external shuffle > deployed in Docker. > If you aren't using Host mode for the docker, you may want to use the > internal ip address of the docker to connect to the external shuffle service. > Also you could use Calico, or just being attached the spark shuffle to a > different IP that is used in the Spark executor (example hosts with multiple > network interfaces). > So this is why I implemented the spark.shuffle.service.host configuration -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22906) External shuffle IP different from Host ip
Unai Sarasola created SPARK-22906: - Summary: External shuffle IP different from Host ip Key: SPARK-22906 URL: https://issues.apache.org/jira/browse/SPARK-22906 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 2.2.1 Reporter: Unai Sarasola Now is possible to configure the spark.shuffle.service.port, but there aren't an equivalent for host. Imagine that you are using an external shuffle deployed in Docker. If you aren't using Host mode for the docker, you may want to use the internal ip address of the docker to connect to the external shuffle service. Also you could use Calico, or just being attached the spark shuffle to a different IP that is used in the Spark executor (example hosts with multiple network interfaces). So this is why I implemented the spark.shuffle.service.host configuration -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22906) External shuffle IP different from Host ip
[ https://issues.apache.org/jira/browse/SPARK-22906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Unai Sarasola updated SPARK-22906: -- Priority: Minor (was: Major) > External shuffle IP different from Host ip > -- > > Key: SPARK-22906 > URL: https://issues.apache.org/jira/browse/SPARK-22906 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.1 >Reporter: Unai Sarasola >Priority: Minor > > Now is possible to configure the spark.shuffle.service.port, but there aren't > an equivalent for host. Imagine that you are using an external shuffle > deployed in Docker. > If you aren't using Host mode for the docker, you may want to use the > internal ip address of the docker to connect to the external shuffle service. > Also you could use Calico, or just being attached the spark shuffle to a > different IP that is used in the Spark executor (example hosts with multiple > network interfaces). > So this is why I implemented the spark.shuffle.service.host configuration -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21136) Misleading error message for typo in SQL
[ https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304378#comment-16304378 ] Denys Zadorozhnyi commented on SPARK-21136: --- I dug up into this issue in Spark 2.2.1 (ANTLR 4.7) and here are my findings: # 1 The offending token is gathered in {{InputMismatchException}} from {{recognizer.getCurrentToken()}} (which is "from" in the examples above). # 2 In ANTLR 4.7.1 in the these cases {{InputMismatchException.offendingState}} and {{InputMismatchException .ctx}} are additionally set that should give some clues (see [https://github.com/antlr/antlr4/pull/1969] and the issue [https://github.com/antlr/antlr4/issues/1922] for details). However the error message is generated in the ANTLR's {{DefaultErrorStrategy.reportErrror()}}. I've considered the idea to pass an error handler ({{DefaultErrorStrategy}} subclass) to the parser and override {{reportError()}}, make a new {{InputMismatchException}} with "correct" {{offendingToken}} and pass it up the chain but it did not feel right. > Misleading error message for typo in SQL > > > Key: SPARK-21136 > URL: https://issues.apache.org/jira/browse/SPARK-21136 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Minor > > {code} > scala> spark.sql("select * from a left joinn b on a.id = b.id").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', > 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', > 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) > == SQL == > select * from a left joinn b on a.id = b.id > -^^^ > {code} > The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of > the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in > themselves, a misleading error like this can hinder debugging substantially. > I tried to see if maybe I could fix this. Am I correct to deduce that the > error message originates in ANTLR4, which parses the query based on the > syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out > how that syntax definition works, and why it misattributes the error. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org