[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569104#comment-16569104 ] Hyukjin Kwon commented on SPARK-24924: -- For fully qualifed path, we already could specify like {{com.databricks.spark.avro.AvroFormat}} and I guess that will use thrid party one if I am not mistaken. Probably we should not do this but this is what we do with CSV which kind of makes a point as well. Wouldn't we better just follow what we do? If we should make an error for this case, I guess it should target 3.0.0 for CSV and revert this PR. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569088#comment-16569088 ] Rifaqat Shah commented on SPARK-14220: -- great! congratulations and thanks.. > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569085#comment-16569085 ] Wenchen Fan commented on SPARK-24924: - when the short name conflicts, I feel it's better to pick the built-in data source than failing the job and say it conflicts. When the full class name of the data source is specified like com.databricks.spark.avro, we should respect it. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24940) Coalesce Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24940. - Resolution: Fixed Assignee: John Zhuge Fix Version/s: 2.4.0 > Coalesce Hint for SQL Queries > - > > Key: SPARK-24940 > URL: https://issues.apache.org/jira/browse/SPARK-24940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 2.4.0 > > > Many Spark SQL users in my company have asked for a way to control the number > of output files in Spark SQL. The users prefer not to use function > repartition\(n\) or coalesce(n, shuffle) that require them to write and > deploy Scala/Java/Python code. > > There are use cases to either reduce or increase the number. > > The DataFrame API has repartition/coalesce for a long time. However, we do > not have an equivalent functionality in SQL queries. We propose adding the > following Hive-style Coalesce hint to Spark SQL. > {noformat} > /*+ COALESCE(n, shuffle) */ > /*+ REPARTITION(n) */ > {noformat} > REPARTITION\(n\) is equal to COALESCE(n, shuffle=true). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24722) Column-based API for pivoting
[ https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24722. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21699 [https://github.com/apache/spark/pull/21699] > Column-based API for pivoting > - > > Key: SPARK-24722 > URL: https://issues.apache.org/jira/browse/SPARK-24722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Currently, the pivot() function accepts the pivot column as a string. It is > not consistent to groupBy API and causes additional problem of using nested > columns as the pivot column. > `Column` support is needed for (a) API consistency, (b) user productivity and > (c) performance. In general, we should follow to the POLA - > https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of > the API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24722) Column-based API for pivoting
[ https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24722: Assignee: Maxim Gekk > Column-based API for pivoting > - > > Key: SPARK-24722 > URL: https://issues.apache.org/jira/browse/SPARK-24722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Currently, the pivot() function accepts the pivot column as a string. It is > not consistent to groupBy API and causes additional problem of using nested > columns as the pivot column. > `Column` support is needed for (a) API consistency, (b) user productivity and > (c) performance. In general, we should follow to the POLA - > https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of > the API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071 ] Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:44 AM: -- If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} For instance, I saw these comments/issues below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's not personally related to me in any way at all but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. was (Author: hyukjin.kwon): If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} For instance, I saw these comments/issues below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.co
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569072#comment-16569072 ] Hyukjin Kwon commented on SPARK-24924: -- Also, for clarification, we already issue warnings: {code} 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). {code} So, I guess it's virtually error vs warning. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071 ] Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:42 AM: -- If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} For instance, I saw these comments/issues below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's related to me in any way but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. was (Author: hyukjin.kwon): If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} So far, I see some issues as below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomm
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071 ] Hyukjin Kwon commented on SPARK-24924: -- If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} So far, I see some issues as below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's related to me in any way but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work
[ https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24888: Assignee: (was: Apache Spark) > spark-submit --master spark://host:port --status driver-id does not work > - > > Key: SPARK-24888 > URL: https://issues.apache.org/jira/browse/SPARK-24888 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: srinivasan >Priority: Major > > spark-submit --master spark://host:port --status driver-id > does not return anything. The command terminates without any error or output. > Behaviour is the same from linux or windows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work
[ https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568969#comment-16568969 ] Apache Spark commented on SPARK-24888: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/21996 > spark-submit --master spark://host:port --status driver-id does not work > - > > Key: SPARK-24888 > URL: https://issues.apache.org/jira/browse/SPARK-24888 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: srinivasan >Priority: Major > > spark-submit --master spark://host:port --status driver-id > does not return anything. The command terminates without any error or output. > Behaviour is the same from linux or windows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work
[ https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24888: Assignee: Apache Spark > spark-submit --master spark://host:port --status driver-id does not work > - > > Key: SPARK-24888 > URL: https://issues.apache.org/jira/browse/SPARK-24888 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: srinivasan >Assignee: Apache Spark >Priority: Major > > spark-submit --master spark://host:port --status driver-id > does not return anything. The command terminates without any error or output. > Behaviour is the same from linux or windows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568893#comment-16568893 ] Matthew Normyle commented on SPARK-24928: - In CartesianRDD.compute, changing: {color:#cc7832}for {color}(x <- rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context){color:#cc7832}; {color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) {color:#33} to:{color} {color:#cc7832}val {color}it1 = rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context) {color:#cc7832}val {color}it2 = rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context) {color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) Seems to resolve this issue. I am brand new to Scala and Spark. Does anyone have any insight as to why this seemingly superficial change could make such a large difference? > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568863#comment-16568863 ] Apache Spark commented on SPARK-18057: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/21995 > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported
[ https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568810#comment-16568810 ] Imran Rashid commented on SPARK-25024: -- [~rvesse] [~arand] maybe you could take a stab at this based on your work on SPARK-16501? Also pinging some other folks knowledgeable about mesos: [~skonto] [~tnachen] > Update mesos documentation to be clear about security supported > --- > > Key: SPARK-25024 > URL: https://issues.apache.org/jira/browse/SPARK-25024 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our mesos deployment docs and security docs and its not > clear at all what type of security and how to set it up for mesos. I think > we should clarify this and have something about exactly what is supported and > what is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24529: Assignee: Kazuaki Ishizaki (was: Apache Spark) > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568806#comment-16568806 ] Apache Spark commented on SPARK-24529: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21994 > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24529: Assignee: Apache Spark (was: Kazuaki Ishizaki) > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25023) Clarify Spark security documentation
[ https://issues.apache.org/jira/browse/SPARK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568805#comment-16568805 ] Thomas Graves commented on SPARK-25023: --- note some of this was already updated with [https://github.com/apache/spark/pull/20742] but I think there are still a few clarifications to make > Clarify Spark security documentation > > > Key: SPARK-25023 > URL: https://issues.apache.org/jira/browse/SPARK-25023 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our deployment docs and security docs and its not clear > at all what deployment modes support exactly what for security. I think we > should clarify the deployments that security is off by default on all > deployments. We may also want to clarify the types of communication used > that would need to be secured. We may also want to clarify multi-tenant safe > vs other things, like standalone mode for instance in my opinion is just note > secure, we do talk about using spark.authenticate for a secret but all > applications would use the same secret. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25024) Update mesos documentation to be clear about security supported
[ https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-25024: -- Comment: was deleted (was: I'm going to work on this.) > Update mesos documentation to be clear about security supported > --- > > Key: SPARK-25024 > URL: https://issues.apache.org/jira/browse/SPARK-25024 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our mesos deployment docs and security docs and its not > clear at all what type of security and how to set it up for mesos. I think > we should clarify this and have something about exactly what is supported and > what is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25023) Clarify Spark security documentation
[ https://issues.apache.org/jira/browse/SPARK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568767#comment-16568767 ] Thomas Graves commented on SPARK-25023: --- I'm going to work on this > Clarify Spark security documentation > > > Key: SPARK-25023 > URL: https://issues.apache.org/jira/browse/SPARK-25023 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our deployment docs and security docs and its not clear > at all what deployment modes support exactly what for security. I think we > should clarify the deployments that security is off by default on all > deployments. We may also want to clarify the types of communication used > that would need to be secured. We may also want to clarify multi-tenant safe > vs other things, like standalone mode for instance in my opinion is just note > secure, we do talk about using spark.authenticate for a secret but all > applications would use the same secret. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported
[ https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568766#comment-16568766 ] Thomas Graves commented on SPARK-25024: --- I'm going to work on this. > Update mesos documentation to be clear about security supported > --- > > Key: SPARK-25024 > URL: https://issues.apache.org/jira/browse/SPARK-25024 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our mesos deployment docs and security docs and its not > clear at all what type of security and how to set it up for mesos. I think > we should clarify this and have something about exactly what is supported and > what is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25024) Update mesos documentation to be clear about security supported
Thomas Graves created SPARK-25024: - Summary: Update mesos documentation to be clear about security supported Key: SPARK-25024 URL: https://issues.apache.org/jira/browse/SPARK-25024 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.2.2 Reporter: Thomas Graves I was reading through our mesos deployment docs and security docs and its not clear at all what type of security and how to set it up for mesos. I think we should clarify this and have something about exactly what is supported and what is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25023) Clarify Spark security documentation
Thomas Graves created SPARK-25023: - Summary: Clarify Spark security documentation Key: SPARK-25023 URL: https://issues.apache.org/jira/browse/SPARK-25023 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.2.2 Reporter: Thomas Graves I was reading through our deployment docs and security docs and its not clear at all what deployment modes support exactly what for security. I think we should clarify the deployments that security is off by default on all deployments. We may also want to clarify the types of communication used that would need to be secured. We may also want to clarify multi-tenant safe vs other things, like standalone mode for instance in my opinion is just note secure, we do talk about using spark.authenticate for a secret but all applications would use the same secret. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568730#comment-16568730 ] Apache Spark commented on SPARK-24983: -- User 'dvogelbacher' has created a pull request for this issue: https://github.com/apache/spark/pull/21993 > Collapsing multiple project statements with dependent When-Otherwise > statements on the same column can OOM the driver > - > > Key: SPARK-24983 > URL: https://issues.apache.org/jira/browse/SPARK-24983 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Priority: Major > > I noticed that writing a spark job that includes many sequential > {{when-otherwise}} statements on the same column can easily OOM the driver > while generating the optimized plan because the project node will grow > exponentially in size. > Example: > {noformat} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> val df = Seq("a", "b", "c", "1").toDF("text") > df: org.apache.spark.sql.DataFrame = [text: string] > scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) > dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: > string] > scala> for( a <- 1 to 5) { > | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === > lit(a.toString), lit("r" + a.toString)).otherwise($"text")) > | } > scala> dfCaseWhen.queryExecution.analyzed > res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] > +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] >+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] > +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] > +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] > +- Filter NOT (text#3 = 0) >+- Project [value#1 AS text#3] > +- LocalRelation [value#1] > scala> dfCaseWhen.queryExecution.optimizedPlan > res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE > WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... > {noformat} > As one can see the optimized plan grows exponentially in the number of > {{when-otherwise}} statements here. > I can see that this comes from the {{CollapseProject}} optimizer rule. > Maybe we should put a limit on the resulting size of the project node after > collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24983: Assignee: Apache Spark > Collapsing multiple project statements with dependent When-Otherwise > statements on the same column can OOM the driver > - > > Key: SPARK-24983 > URL: https://issues.apache.org/jira/browse/SPARK-24983 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Assignee: Apache Spark >Priority: Major > > I noticed that writing a spark job that includes many sequential > {{when-otherwise}} statements on the same column can easily OOM the driver > while generating the optimized plan because the project node will grow > exponentially in size. > Example: > {noformat} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> val df = Seq("a", "b", "c", "1").toDF("text") > df: org.apache.spark.sql.DataFrame = [text: string] > scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) > dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: > string] > scala> for( a <- 1 to 5) { > | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === > lit(a.toString), lit("r" + a.toString)).otherwise($"text")) > | } > scala> dfCaseWhen.queryExecution.analyzed > res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] > +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] >+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] > +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] > +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] > +- Filter NOT (text#3 = 0) >+- Project [value#1 AS text#3] > +- LocalRelation [value#1] > scala> dfCaseWhen.queryExecution.optimizedPlan > res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE > WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... > {noformat} > As one can see the optimized plan grows exponentially in the number of > {{when-otherwise}} statements here. > I can see that this comes from the {{CollapseProject}} optimizer rule. > Maybe we should put a limit on the resulting size of the project node after > collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24983: Assignee: (was: Apache Spark) > Collapsing multiple project statements with dependent When-Otherwise > statements on the same column can OOM the driver > - > > Key: SPARK-24983 > URL: https://issues.apache.org/jira/browse/SPARK-24983 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Priority: Major > > I noticed that writing a spark job that includes many sequential > {{when-otherwise}} statements on the same column can easily OOM the driver > while generating the optimized plan because the project node will grow > exponentially in size. > Example: > {noformat} > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> val df = Seq("a", "b", "c", "1").toDF("text") > df: org.apache.spark.sql.DataFrame = [text: string] > scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) > dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: > string] > scala> for( a <- 1 to 5) { > | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === > lit(a.toString), lit("r" + a.toString)).otherwise($"text")) > | } > scala> dfCaseWhen.queryExecution.analyzed > res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] > +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] >+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] > +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] > +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] > +- Filter NOT (text#3 = 0) >+- Project [value#1 AS text#3] > +- LocalRelation [value#1] > scala> dfCaseWhen.queryExecution.optimizedPlan > res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) > THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 > ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN > (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = > 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE > WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... > {noformat} > As one can see the optimized plan grows exponentially in the number of > {{when-otherwise}} statements here. > I can see that this comes from the {{CollapseProject}} optimizer rule. > Maybe we should put a limit on the resulting size of the project node after > collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's
[ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568718#comment-16568718 ] Barry Becker commented on SPARK-21986: -- Here are a couple more test cases that show the problem: {code:java} test("Quantile discretizer on data with that is only -1, and 1 (and mostly -1)") { verify(Seq(-1, -1, 1, -1, -1, -1, 1, -1, -1, 1, -1), Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity)) } test("Quantile discretizer on data with that is only -1, 0, and 1 (and mostly -1)") { verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, -1, -1, 1, -1), Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity)) } test("Quantile discretizer on data with that is only -1, 0, and 1 ") { // this is ok verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, 1, -1), Seq(Double.NegativeInfinity, -1, 0, Double.PositiveInfinity)) }{code} If the bin were defined as (low, high] instead of [low, high), then I believe all the cases would be correct. Maybe if all the cuts has a very small epsilon added, or simply selected the next distinct value, then they would also all be correct. > QuantileDiscretizer picks wrong split point for data with lots of 0's > - > > Key: SPARK-21986 > URL: https://issues.apache.org/jira/browse/SPARK-21986 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Barry Becker >Priority: Minor > > I have some simple test cases to help illustrate (see below). > I discovered this with data that had 96,000 rows, but can reproduce with much > smaller data that has roughly the same distribution of values. > If I have data like > Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) > and ask for 3 buckets, then it does the right thing and yields splits of > Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity) > However, if I add just one more zero, such that I have data like > Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) > then it will do the wrong thing and give splits of > Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) > I'm not bothered that it gave fewer buckets than asked for (that is to be > expected), but I am bothered that it picked 0.0 instead of 40 as the one > split point. > The way it did it, now I have 1 bucket with all the data, and a second with > none of the data. > Am I interpreting something wrong? > Here are my 2 test cases in scala: > {code} > class QuantileDiscretizerSuite extends FunSuite { > test("Quantile discretizer on data with lots of 0") { > verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), > Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) > } > test("Quantile discretizer on data with one less 0") { > verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), > Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)) > } > > def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = { > val theData: Seq[(Int, Double)] = data.map { > case x: Int => (x, 0.0) > case _ => (0, 0.0) > } > val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", > "unused") > val qb = new QuantileDiscretizer() > .setInputCol("rawCol") > .setOutputCol("binnedColumn") > .setRelativeError(0.0) > .setNumBuckets(3) > .fit(df) > assertResult(expectedSplits) {qb.getSplits} > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568664#comment-16568664 ] Mridul Muralidharan commented on SPARK-24375: - {quote} It's not desired behavior to catch exception thrown by TaskContext.barrier() silently. However, in case this really happens, we can detect that because we have `epoch` both in driver side and executor side, more details will go to the design doc of BarrierTaskContext.barrier() SPARK-24581 {quote} The current 'barrier' function does not identify 'which' barrier it is from a user point of view. Here, due to exceptions raised (not necessarily from barrier(), but could be from user code as well), different tasks are waiting on different barriers. {code} try { ... snippet A ... // Barrier 1 context.barrier() ... snippet B ... } catch { ... } ... snippet C ... // Barrier 2 context.barrier() {code} T1 waits on barrier 1, T2 could have raised exception in snippet A and ends up waiting on Barrier 2 (having never seen Barrier 1). In this scenario, how is spark making progress ? (And ofcourse, when T1 reaches barrier 2, when T2 has moved past it). I did not see this clarified in the design or in the implementation. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
[ https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Saltzer updated SPARK-25020: -- Description: Opening this up to give you guys some insight in an issue that will occur when using Spark Streaming with Hadoop 2.8. Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout for its shutdown hook. From our tests, if the Spark job takes longer than 10 seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample over the graceful shutdown and throw an exception like {code:java} 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114) at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681) at org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715) at org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} The reason I hit this issue is because we recently upgraded to EMR 5.15, which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven successful to us (in limited testing) Instead of just running {code:java} ... ssc.start() ssc.awaitTermination(){code} We needed to do the following {code:java} ... ssc.start() sys.ShutdownHookThread { ssc.stop(true, true) } ssc.awaitTermination(){code} As far as I can tell, there is no way to override the default {{10 second}} timeout in HADOOP-12950, which is why we had to go with the workaround. Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 2.2.x & Hadoop 2.8. Ricky Epic Games was: Opening this up to give you guys some insight in an issue that will occur when using Spark Streaming with Hadoop 2.8. Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout for its shutdown hook. From our tests, if the Spark job takes longer than 10 seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample over the graceful shutdown and throw an exception like {code:java} 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114) at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682) at org.apache.spark.util.Utils$.tryLogN
[jira] [Created] (SPARK-25022) Add spark.executor.pyspark.memory support to Mesos
Ryan Blue created SPARK-25022: - Summary: Add spark.executor.pyspark.memory support to Mesos Key: SPARK-25022 URL: https://issues.apache.org/jira/browse/SPARK-25022 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.3.0 Reporter: Ryan Blue SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory allocation for PySpark and updates YARN to add this memory to its container requests. Mesos should do something similar to account for the python memory allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes
Ryan Blue created SPARK-25021: - Summary: Add spark.executor.pyspark.memory support to Kubernetes Key: SPARK-25021 URL: https://issues.apache.org/jira/browse/SPARK-25021 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Reporter: Ryan Blue SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory allocation for PySpark and updates YARN to add this memory to its container requests. Kubernetes should do something similar to account for the python memory allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
Ricky Saltzer created SPARK-25020: - Summary: Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8 Key: SPARK-25020 URL: https://issues.apache.org/jira/browse/SPARK-25020 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0 Environment: Spark Streaming -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful shutdown) Hadoop 2.8 Reporter: Ricky Saltzer Opening this up to give you guys some insight in an issue that will occur when using Spark Streaming with Hadoop 2.8. Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout for its shutdown hook. From our tests, if the Spark job takes longer than 10 seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample over the graceful shutdown and throw an exception like {code:java} 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114) at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681) at org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715) at org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} The reason I hit this issue is because we recently upgraded to EMR 5.15, which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven successful to us (in limited testing) Instead of just running {code:java} ... ssc.start() ssc.awaitTermination(){code} We needed to do the following {code:java} ... ssc.start() { ssc.stop(true, true) } ssc.awaitTermination(){code} As far as I can tell, there is no way to override the default {{10 second}} timeout in HADOOP-12950, which is why we had to go with the workaround. Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 2.2.x & Hadoop 2.8. Ricky Epic Games -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
[ https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568554#comment-16568554 ] Yin Huai commented on SPARK-25019: -- [~dongjoon] can you help us fix this issue? Or there is a reason that the parent pom and sql/core/pom are not consistent? > The published spark sql pom does not exclude the normal version of orc-core > > > Key: SPARK-25019 > URL: https://issues.apache.org/jira/browse/SPARK-25019 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.4.0 >Reporter: Yin Huai >Priority: Critical > > I noticed that > [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] > does not exclude the normal version of orc-core. Comparing with > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] > and > [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] > we only exclude the normal version of orc-core in the parent pom. So, the > problem is that if a developer depends on spark-sql-core directly, orc-core > and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core
Yin Huai created SPARK-25019: Summary: The published spark sql pom does not exclude the normal version of orc-core Key: SPARK-25019 URL: https://issues.apache.org/jira/browse/SPARK-25019 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 2.4.0 Reporter: Yin Huai I noticed that [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom] does not exclude the normal version of orc-core. Comparing with [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108] and [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,] we only exclude the normal version of orc-core in the parent pom. So, the problem is that if a developer depends on spark-sql-core directly, orc-core and orc-core-nohive will be in the dependency list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
[ https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25018: Assignee: Apache Spark > Use `Co-Authored-By` git trailer in `merge_spark_pr.py` > --- > > Key: SPARK-25018 > URL: https://issues.apache.org/jira/browse/SPARK-25018 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > > Many projects such as openstack are using `Co-Authored-By: name > ` in commit messages to indicate people who worked on a > particular patch. > It's a convention for recognizing multiple authors, and can encourage people > to collaborate. > Co-authored commits are visible on GitHub and can be included in the profile > contributions graph and the repository's statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
[ https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25018: Assignee: (was: Apache Spark) > Use `Co-Authored-By` git trailer in `merge_spark_pr.py` > --- > > Key: SPARK-25018 > URL: https://issues.apache.org/jira/browse/SPARK-25018 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > > Many projects such as openstack are using `Co-Authored-By: name > ` in commit messages to indicate people who worked on a > particular patch. > It's a convention for recognizing multiple authors, and can encourage people > to collaborate. > Co-authored commits are visible on GitHub and can be included in the profile > contributions graph and the repository's statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
[ https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568545#comment-16568545 ] Apache Spark commented on SPARK-25018: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/21991 > Use `Co-Authored-By` git trailer in `merge_spark_pr.py` > --- > > Key: SPARK-25018 > URL: https://issues.apache.org/jira/browse/SPARK-25018 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > > Many projects such as openstack are using `Co-Authored-By: name > ` in commit messages to indicate people who worked on a > particular patch. > It's a convention for recognizing multiple authors, and can encourage people > to collaborate. > Co-authored commits are visible on GitHub and can be included in the profile > contributions graph and the repository's statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
DB Tsai created SPARK-25018: --- Summary: Use `Co-Authored-By` git trailer in `merge_spark_pr.py` Key: SPARK-25018 URL: https://issues.apache.org/jira/browse/SPARK-25018 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.4.0 Reporter: DB Tsai Many projects such as openstack are using `Co-Authored-By: name ` in commit messages to indicate people who worked on a particular patch. It's a convention for recognizing multiple authors, and can encourage people to collaborate. Co-authored commits are visible on GitHub and can be included in the profile contributions graph and the repository's statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25017) Add test suite for ContextBarrierState
Jiang Xingbo created SPARK-25017: Summary: Add test suite for ContextBarrierState Key: SPARK-25017 URL: https://issues.apache.org/jira/browse/SPARK-25017 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo We shall be able to add unit test to ContextBarrierState with a mocked RpcCallContext. Currently it's only covered by end-to-end test in `BarrierTaskContextSuite` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25016) remove Support for hadoop 2.6
Thomas Graves created SPARK-25016: - Summary: remove Support for hadoop 2.6 Key: SPARK-25016 URL: https://issues.apache.org/jira/browse/SPARK-25016 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Thomas Graves Hadoop 2.6 is now old, no releases have been done in 2 year and it doesn't look like they are patching it for security or bug fixes. We should stop supporting it in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25016) remove Support for hadoop 2.6
[ https://issues.apache.org/jira/browse/SPARK-25016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-25016: -- Target Version/s: 3.0.0 > remove Support for hadoop 2.6 > - > > Key: SPARK-25016 > URL: https://issues.apache.org/jira/browse/SPARK-25016 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Thomas Graves >Priority: Major > > Hadoop 2.6 is now old, no releases have been done in 2 year and it doesn't > look like they are patching it for security or bug fixes. We should stop > supporting it in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-24918: - Labels: SPIP memory-analysis (was: memory-analysis) > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568471#comment-16568471 ] Imran Rashid commented on SPARK-24918: -- attached an [spip proposal|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing] > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24954: - Assignee: Jiang Xingbo > Fail fast on job submit if run a barrier stage with dynamic resource > allocation enabled > --- > > Key: SPARK-24954 > URL: https://issues.apache.org/jira/browse/SPARK-24954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.4.0 > > > Since we explicitly listed "Support running barrier stage with dynamic > resource allocation" a Non-Goal in the design doc, we shall fail fast on job > submit if running a barrier stage with dynamic resource allocation enabled, > to avoid some confusing behaviors (can refer to SPARK-24942 for some > examples). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24954. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21915 [https://github.com/apache/spark/pull/21915] > Fail fast on job submit if run a barrier stage with dynamic resource > allocation enabled > --- > > Key: SPARK-24954 > URL: https://issues.apache.org/jira/browse/SPARK-24954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.4.0 > > > Since we explicitly listed "Support running barrier stage with dynamic > resource allocation" a Non-Goal in the design doc, we shall fail fast on job > submit if running a barrier stage with dynamic resource allocation enabled, > to avoid some confusing behaviors (can refer to SPARK-24942 for some > examples). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568454#comment-16568454 ] Reynold Xin commented on SPARK-24924: - I like the improved error message (I didn't read the earlier comments in this thread). > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568439#comment-16568439 ] Aditya Kamath commented on SPARK-20696: --- [~rajanimaski] Please let me know which implementation of Kmeans you created. I'd really like this information. Thank you. > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir >Priority: Major > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25011) Add PrefixSpan to __all__ in fpm.py
[ https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-25011: --- Summary: Add PrefixSpan to __all__ in fpm.py (was: Add PrefixSpan to __all__) > Add PrefixSpan to __all__ in fpm.py > --- > > Key: SPARK-25011 > URL: https://issues.apache.org/jira/browse/SPARK-25011 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.4.0 > > > Add PrefixSpan to __all__ in fpm.py -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568415#comment-16568415 ] Russell Spitzer commented on SPARK-25003: - [~holden.karau] , Wrote up a PR for each branch target since I'm not sure what version you would think best for the update. Please let me know if you have any feedback or advice on how to get an automatic test in :) > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568414#comment-16568414 ] Apache Spark commented on SPARK-25003: -- User 'RussellSpitzer' has created a pull request for this issue: https://github.com/apache/spark/pull/21990 > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568405#comment-16568405 ] Apache Spark commented on SPARK-25003: -- User 'RussellSpitzer' has created a pull request for this issue: https://github.com/apache/spark/pull/21989 > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25003: Assignee: (was: Apache Spark) > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25003: Assignee: Apache Spark > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Assignee: Apache Spark >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions
[ https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568396#comment-16568396 ] Apache Spark commented on SPARK-25003: -- User 'RussellSpitzer' has created a pull request for this issue: https://github.com/apache/spark/pull/21988 > Pyspark Does not use Spark Sql Extensions > - > > Key: SPARK-25003 > URL: https://issues.apache.org/jira/browse/SPARK-25003 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.2, 2.3.1 >Reporter: Russell Spitzer >Priority: Major > > When creating a SparkSession here > [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216] > {code:python} > if jsparkSession is None: > jsparkSession = self._jvm.SparkSession(self._jsc.sc()) > self._jsparkSession = jsparkSession > {code} > I believe it ends up calling the constructor here > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87 > {code:scala} > private[sql] def this(sc: SparkContext) { > this(sc, None, None, new SparkSessionExtensions) > } > {code} > Which creates a new SparkSessionsExtensions object and does not pick up new > extensions that could have been set in the config like the companion > getOrCreate does. > https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944 > {code:scala} > //in getOrCreate > // Initialize extensions if the user has defined a configurator class. > val extensionConfOption = > sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) > if (extensionConfOption.isDefined) { > val extensionConfClassName = extensionConfOption.get > try { > val extensionConfClass = > Utils.classForName(extensionConfClassName) > val extensionConf = extensionConfClass.newInstance() > .asInstanceOf[SparkSessionExtensions => Unit] > extensionConf(extensions) > } catch { > // Ignore the error if we cannot find the class or when the class > has the wrong type. > case e @ (_: ClassCastException | > _: ClassNotFoundException | > _: NoClassDefFoundError) => > logWarning(s"Cannot use $extensionConfClassName to configure > session extensions.", e) > } > } > {code} > I think a quick fix would be to use the getOrCreate method from the companion > object instead of calling the constructor from the SparkContext. Or we could > fix this by ensuring that all constructors attempt to pick up custom > extensions if they are set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25015) Update Hadoop 2.7 to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25015: Assignee: Apache Spark (was: Sean Owen) > Update Hadoop 2.7 to 2.7.7 > -- > > Key: SPARK-25015 > URL: https://issues.apache.org/jira/browse/SPARK-25015 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and > security fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25015) Update Hadoop 2.7 to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568392#comment-16568392 ] Apache Spark commented on SPARK-25015: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/21987 > Update Hadoop 2.7 to 2.7.7 > -- > > Key: SPARK-25015 > URL: https://issues.apache.org/jira/browse/SPARK-25015 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and > security fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568393#comment-16568393 ] Thomas Graves commented on SPARK-24924: --- | It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. If the user has been using the databrick spark-avro version for other releases and it was working fine and now we magically map it to a different version and they break, they are going to complain and say, I didn't change anything why did this break. Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. Just a note the fact we use different version isn't my issue, I'm happy to make that work, I'm worried about other users who didn't happen to see this jira. I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? How many users complained about the csv thing? Could we just improve the error message to more simply state, "Multiple sources found, perhaps you are including an external package that also supports avro. Spark started internally supporting as of release X.Y, please remove the external package or rewrite to use different function" > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25015) Update Hadoop 2.7 to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25015: Assignee: Sean Owen (was: Apache Spark) > Update Hadoop 2.7 to 2.7.7 > -- > > Key: SPARK-25015 > URL: https://issues.apache.org/jira/browse/SPARK-25015 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and > security fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25015) Update Hadoop 2.7 to 2.7.7
Sean Owen created SPARK-25015: - Summary: Update Hadoop 2.7 to 2.7.7 Key: SPARK-25015 URL: https://issues.apache.org/jira/browse/SPARK-25015 Project: Spark Issue Type: Task Components: Build Affects Versions: 2.3.1, 2.2.2, 2.1.3 Reporter: Sean Owen Assignee: Sean Owen We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and security fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18057: -- Priority: Major (was: Blocker) > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18057. --- Resolution: Fixed Fix Version/s: 2.4.0 > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error
[ https://issues.apache.org/jira/browse/SPARK-25014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25014: -- Priority: Major (was: Blocker) Fix Version/s: (was: 2.3.2) Please read [https://spark.apache.org/contributing.html] For example, don't set Blocker. Nothing about this rules out an env problem or code problem. Jira is for reporting issues narrowed down to Spark, rather than asking for support. > When we tried to read kafka topic through spark streaming spark submit is > getting failed with Python worker exited unexpectedly (crashed) error > > > Key: SPARK-25014 > URL: https://issues.apache.org/jira/browse/SPARK-25014 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Hi Team , > > TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER' > PARTITION = 0 > topicAndPartition = TopicAndPartition(TOPIC, PARTITION) > fromOffsets1 = \{topicAndPartition:int(PARTITION)} > > def handler(message): > records = message.collect() > for record in records: > value_all=record[1] > value_key=record[0] > # print(value_all) > > schema_registry_client = > CachedSchemaRegistryClient(url='http://localhost:8081') > serializer = MessageSerializer(schema_registry_client) > sc = SparkContext(appName="PythonStreamingAvro") > ssc = StreamingContext(sc, 10) > kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], > \{"metadata.broker.list": > 'localhost:9092'},valueDecoder=serializer.decode_message) > lines = kvs.map(lambda x: x[1]) > lines.pprint() > kvs.foreachRDD(handler) > > ssc.start() > ssc.awaitTermination() > > This is code we trying to pull the data from kafka topic . when we execute > through spark submit we are getting below error > > > 2018-08-03 11:10:40 INFO VerifiableProperties:68 - Property > zookeeper.connect is overridden to > 2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly > (crashed) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py", > line 215, in main > eval_type = read_int(infile) > File > "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py", > line 685, in read_int > raise EOFError > EOFError > > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error
[ https://issues.apache.org/jira/browse/SPARK-25014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25014. --- Resolution: Invalid > When we tried to read kafka topic through spark streaming spark submit is > getting failed with Python worker exited unexpectedly (crashed) error > > > Key: SPARK-25014 > URL: https://issues.apache.org/jira/browse/SPARK-25014 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Hi Team , > > TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER' > PARTITION = 0 > topicAndPartition = TopicAndPartition(TOPIC, PARTITION) > fromOffsets1 = \{topicAndPartition:int(PARTITION)} > > def handler(message): > records = message.collect() > for record in records: > value_all=record[1] > value_key=record[0] > # print(value_all) > > schema_registry_client = > CachedSchemaRegistryClient(url='http://localhost:8081') > serializer = MessageSerializer(schema_registry_client) > sc = SparkContext(appName="PythonStreamingAvro") > ssc = StreamingContext(sc, 10) > kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], > \{"metadata.broker.list": > 'localhost:9092'},valueDecoder=serializer.decode_message) > lines = kvs.map(lambda x: x[1]) > lines.pprint() > kvs.foreachRDD(handler) > > ssc.start() > ssc.awaitTermination() > > This is code we trying to pull the data from kafka topic . when we execute > through spark submit we are getting below error > > > 2018-08-03 11:10:40 INFO VerifiableProperties:68 - Property > zookeeper.connect is overridden to > 2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly > (crashed) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py", > line 215, in main > eval_type = read_int(infile) > File > "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py", > line 685, in read_int > raise EOFError > EOFError > > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568352#comment-16568352 ] Hyukjin Kwon commented on SPARK-24924: -- cc [~cloud_fan] since we talked about this for CSV, and [~rxin] who agreed upon not adding .avro for now, FYI. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568348#comment-16568348 ] Hyukjin Kwon edited comment on SPARK-24924 at 8/3/18 3:29 PM: -- {quote} but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? {quote} If users call this, that's still going to use the builtin implemtnation (https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26) as it's a short name for {{format("com.databricks.spark.avro")}}. {quote} our internal implementation which could very well be different. {quote} It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. {quote} I would rather just plain error out saying these conflict, either update or change your external package to use a different name. {quote} IIRC, in the past, we did for CSV datasource and many users complained about this. {code} java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. {code} In practice, I am actually a bit more sure on the current approach since users actually complained about this a lot and now I am not seeing (so far) the complaints about the current approach. {quote} There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. {quote} It's went away so I guess if the jar is provided with implicit import to support this, this should work as usual and use the internal implementation in theory. If the jar is not given, .avro API is not supported and the internal implmentation will be used. was (Author: hyukjin.kwon): {quote} but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? {quote} If users call this, that's still going to use the builtin implemtnation (https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26) as it's a short name for {{format("com.databricks.spark.avro")}}. {quote} our internal implementation which could very well be different. {quote} It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. {quote} I would rather just plain error out saying these conflict, either update or change your external package to use a different name. {quote} IIRC, in the past, we did for CSV datasource and many users complained about this. {code} java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. {code} In practice, I am actually a bit more sure on the current approach since users actually complained about his a lot and now I am not seeing (so far) the complains about the current approach. {code} There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. {code} It's went away so I guess if the jar is provided with implicit import to support this, this should work as usual and use the internal implementation in theory. If the jar is not given, .avro API is not supported and the internal implmentation will be used. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568348#comment-16568348 ] Hyukjin Kwon commented on SPARK-24924: -- {quote} but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? {quote} If users call this, that's still going to use the builtin implemtnation (https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26) as it's a short name for {{format("com.databricks.spark.avro")}}. {quote} our internal implementation which could very well be different. {quote} It wouldn't be very different for 2.4.0. It could be different but I guess it should be incremental improvement without behaviour changes. {quote} I would rather just plain error out saying these conflict, either update or change your external package to use a different name. {quote} IIRC, in the past, we did for CSV datasource and many users complained about this. {code} java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. {code} In practice, I am actually a bit more sure on the current approach since users actually complained about his a lot and now I am not seeing (so far) the complains about the current approach. {code} There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. {code} It's went away so I guess if the jar is provided with implicit import to support this, this should work as usual and use the internal implementation in theory. If the jar is not given, .avro API is not supported and the internal implmentation will be used. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP
[ https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568337#comment-16568337 ] Apache Spark commented on SPARK-23937: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21986 > High-order function: map_filter(map, function) → MAP > -- > > Key: SPARK-23937 > URL: https://issues.apache.org/jira/browse/SPARK-23937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Constructs a map from those entries of map for which function returns true: > {noformat} > SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {} > SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v > IS NOT NULL); -- {10 -> a, 30 -> c} > SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v > > 10); -- {k1 -> 20, k3 -> 15} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23937) High-order function: map_filter(map, function) → MAP
[ https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23937: Assignee: (was: Apache Spark) > High-order function: map_filter(map, function) → MAP > -- > > Key: SPARK-23937 > URL: https://issues.apache.org/jira/browse/SPARK-23937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Constructs a map from those entries of map for which function returns true: > {noformat} > SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {} > SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v > IS NOT NULL); -- {10 -> a, 30 -> c} > SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v > > 10); -- {k1 -> 20, k3 -> 15} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23937) High-order function: map_filter(map, function) → MAP
[ https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23937: Assignee: Apache Spark > High-order function: map_filter(map, function) → MAP > -- > > Key: SPARK-23937 > URL: https://issues.apache.org/jira/browse/SPARK-23937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Constructs a map from those entries of map for which function returns true: > {noformat} > SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {} > SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v > IS NOT NULL); -- {10 -> a, 30 -> c} > SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v > > 10); -- {k1 -> 20, k3 -> 15} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error
KARTHIKEYAN RASIPALAYAM DURAIRAJ created SPARK-25014: Summary: When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error Key: SPARK-25014 URL: https://issues.apache.org/jira/browse/SPARK-25014 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ Fix For: 2.3.2 Hi Team , TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER' PARTITION = 0 topicAndPartition = TopicAndPartition(TOPIC, PARTITION) fromOffsets1 = \{topicAndPartition:int(PARTITION)} def handler(message): records = message.collect() for record in records: value_all=record[1] value_key=record[0] # print(value_all) schema_registry_client = CachedSchemaRegistryClient(url='http://localhost:8081') serializer = MessageSerializer(schema_registry_client) sc = SparkContext(appName="PythonStreamingAvro") ssc = StreamingContext(sc, 10) kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], \{"metadata.broker.list": 'localhost:9092'},valueDecoder=serializer.decode_message) lines = kvs.map(lambda x: x[1]) lines.pprint() kvs.foreachRDD(handler) ssc.start() ssc.awaitTermination() This is code we trying to pull the data from kafka topic . when we execute through spark submit we are getting below error 2018-08-03 11:10:40 INFO VerifiableProperties:68 - Property zookeeper.connect is overridden to 2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py", line 215, in main eval_type = read_int(infile) File "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py", line 685, in read_int raise EOFError EOFError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568238#comment-16568238 ] Nick Poorman commented on SPARK-14220: -- Awesome job! (y) > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24884: Assignee: (was: Apache Spark) > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568215#comment-16568215 ] Apache Spark commented on SPARK-24884: -- User 'xueyumusic' has created a pull request for this issue: https://github.com/apache/spark/pull/21985 > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24884: Assignee: Apache Spark > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Assignee: Apache Spark >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568204#comment-16568204 ] Thomas Graves commented on SPARK-24924: --- [~felixcheung] did your discussion on the same thing with csv get resolved? > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568199#comment-16568199 ] Thomas Graves commented on SPARK-24924: --- Hmm, so we are adding this for ease of upgrading I guess (so user doesn't have to change their code), but at the same time we aren't adding the spark.read.avro syntax so it break in that case or they get a different implementation by default? This doesn't make sense to me. Personally I don't like having some other add on package names in our code at all and here we are mapping what the user thought they would get to our internal implementation which could very well be different. I would rather just plain error out saying these conflict, either update or change your external package to use a different name. There is also the case one might be able to argue its breaking api compatilibity since .avro option went away, buts it a third party library so you can probably get away with that. > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP
[ https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568152#comment-16568152 ] Marco Gaido commented on SPARK-23937: - I am working on this, thanks. > High-order function: map_filter(map, function) → MAP > -- > > Key: SPARK-23937 > URL: https://issues.apache.org/jira/browse/SPARK-23937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Constructs a map from those entries of map for which function returns true: > {noformat} > SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {} > SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v > IS NOT NULL); -- {10 -> a, 30 -> c} > SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v > > 10); -- {k1 -> 20, k3 -> 15} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24772) support reading AVRO logical types - Decimal
[ https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568139#comment-16568139 ] Apache Spark commented on SPARK-24772: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21984 > support reading AVRO logical types - Decimal > > > Key: SPARK-24772 > URL: https://issues.apache.org/jira/browse/SPARK-24772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24772) support reading AVRO logical types - Decimal
[ https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24772: Assignee: (was: Apache Spark) > support reading AVRO logical types - Decimal > > > Key: SPARK-24772 > URL: https://issues.apache.org/jira/browse/SPARK-24772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24772) support reading AVRO logical types - Decimal
[ https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24772: Assignee: Apache Spark > support reading AVRO logical types - Decimal > > > Key: SPARK-24772 > URL: https://issues.apache.org/jira/browse/SPARK-24772 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert
[ https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24998: Summary: spark-sql will scan the same table repeatedly when doing multi-insert (was: spark-sql will scan the same table repeatedly when doing multi-insert") > spark-sql will scan the same table repeatedly when doing multi-insert > - > > Key: SPARK-24998 > URL: https://issues.apache.org/jira/browse/SPARK-24998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > Attachments: scan_table_many_times.png > > > Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT > INTO c ..." . > Following screenshot shows the stages: > !scan_table_many_times.png! > But, Hive only scan once. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert"
[ https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24998: Description: Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT INTO c ..." . Following screenshot shows the stages: !scan_table_many_times.png! But, Hive only scan once. was: Such as the following screenshot: !scan_table_many_times.png! But, Hive only scan once. > spark-sql will scan the same table repeatedly when doing multi-insert" > -- > > Key: SPARK-24998 > URL: https://issues.apache.org/jira/browse/SPARK-24998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > Attachments: scan_table_many_times.png > > > Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT > INTO c ..." . > Following screenshot shows the stages: > !scan_table_many_times.png! > But, Hive only scan once. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert"
[ https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24998: Summary: spark-sql will scan the same table repeatedly when doing multi-insert" (was: spark-sql will scan the same table repeatedly when the query likes "From xx SELECT yy INSERT a INSERT b INSERT c ...") > spark-sql will scan the same table repeatedly when doing multi-insert" > -- > > Key: SPARK-24998 > URL: https://issues.apache.org/jira/browse/SPARK-24998 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > Attachments: scan_table_many_times.png > > > Such as the following screenshot: > !scan_table_many_times.png! > But, Hive only scan once. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25013) JDBC urls with jdbc:mariadb don't work as expected
Dieter Vekeman created SPARK-25013: -- Summary: JDBC urls with jdbc:mariadb don't work as expected Key: SPARK-25013 URL: https://issues.apache.org/jira/browse/SPARK-25013 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Dieter Vekeman When using the MariaDB JDBC driver, the JDBC connection url should be {code:java} jdbc:mariadb://localhost:3306/DB?user=someuser&password=somepassword {code} https://mariadb.com/kb/en/library/about-mariadb-connector-j/ However this does not work well in Spark (see below) *Workaround* The MariaDB driver also supports using mysql which does work. The problem seems to have been described and identified in: https://jira.mariadb.org/browse/CONJ-421 All works well with spark using connection string with {{"jdbc:mysql:..."}}, but not using {{"jdbc:mariadb:..."}} because MySQL dialect is then not used. when not used, defaut quote is {{"}}, not {{`}} So, some internal query generated by spark like {{SELECT `i`,`ip` FROM tmp}} will then be executed as {{SELECT "i","ip" FROM tmp}} with dataType previously retrieved, causing the exception The author of the comment says {quote}I'll make a pull request to spark so "jdbc:mariadb:" connection string can be handle{quote} Did the pull request get lost or should a new one be made? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23932) High-order function: zip_with(array, array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568059#comment-16568059 ] Takuya Ueshin commented on SPARK-23932: --- Hi [~crafty-coder], Are you still working on this? Recently I added a base framework for higher-order functions. You can use it to implement this. If you don't have enough time, I can take this over, so please let me know. Thanks! > High-order function: zip_with(array, array, function) → > array > --- > > Key: SPARK-23932 > URL: https://issues.apache.org/jira/browse/SPARK-23932 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the two given arrays, element-wise, into a single array using > function. Both arrays must be the same length. > {noformat} > SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- > [ROW('a', 1), ROW('b', 3), ROW('c', 5)] > SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6] > SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> > concat(x, y)); -- ['ad', 'be', 'cf'] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24987: Assignee: (was: Apache Spark) > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Priority: Critical > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24987: Assignee: Apache Spark > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Assignee: Apache Spark >Priority: Critical > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors
[ https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568022#comment-16568022 ] Apache Spark commented on SPARK-24987: -- User 'YuvalItzchakov' has created a pull request for this issue: https://github.com/apache/spark/pull/21983 > Kafka Cached Consumer Leaking File Descriptors > -- > > Key: SPARK-24987 > URL: https://issues.apache.org/jira/browse/SPARK-24987 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.1 > Java(TM) SE Runtime Environment (build 1.8.0_112-b15) > Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) > >Reporter: Yuval Itzchakov >Priority: Critical > > Setup: > * Spark 2.3.1 > * Java 1.8.0 (112) > * Standalone Cluster Manager > * 3 Nodes, 1 Executor per node. > Spark 2.3.0 introduced a new mechanism for caching Kafka consumers > (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel) > via KafkaDataConsumer.acquire. > It seems that there are situations (I've been trying to debug it, haven't > been able to find the root cause as of yet) where cached consumers remain "in > use" throughout the life time of the task and are never released. This can be > identified by the following line of the stack trace: > at > org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460) > Which points to: > {code:java} > } else if (existingInternalConsumer.inUse) { > // If consumer is already cached but is currently in use, then return a new > consumer > NonCachedKafkaDataConsumer(newInternalConsumer) > {code} > Meaning the existing consumer created for that `TopicPartition` is still in > use for some reason. The weird thing is that you can see this for very old > tasks which have already finished successfully. > I've traced down this leak using file leak detector, attaching it to the > running Executor JVM process. I've emitted the list of open file descriptors > which [you can find > here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d], > and you can see that the majority of them are epoll FD used by Kafka > Consumers, indicating that they aren't closing. > Spark graph: > {code:java} > kafkaStream > .load() > .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") > .as[(String, String)] > .flatMap {...} > .groupByKey(...) > .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...) > .foreach(...) > .outputMode(OutputMode.Update) > .option("checkpointLocation", > sparkConfiguration.properties.checkpointDirectory) > .start() > .awaitTermination(){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24598) SPARK SQL:Datatype overflow conditions gives incorrect result
[ https://issues.apache.org/jira/browse/SPARK-24598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568005#comment-16568005 ] Marco Gaido commented on SPARK-24598: - [~smilegator] as we just enhanced the doc, but we have not really addressed the overflow condition, which I think we are targeting for a fix for 3.0, shall we leave this open for now and resolve it once the actual fix is in place? What do you think? Thanks. > SPARK SQL:Datatype overflow conditions gives incorrect result > - > > Key: SPARK-24598 > URL: https://issues.apache.org/jira/browse/SPARK-24598 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: navya >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Execute an sql query, so that it results in overflow conditions. > EX - SELECT 9223372036854775807 + 1 result = -9223372036854776000 > > Expected result - Error should be throw like mysql. > mysql> SELECT 9223372036854775807 + 1; > ERROR 1690 (22003): BIGINT value is out of range in '(9223372036854775807 + > 1)' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25012) dataframe creation results in matcherror
uwe created SPARK-25012: --- Summary: dataframe creation results in matcherror Key: SPARK-25012 URL: https://issues.apache.org/jira/browse/SPARK-25012 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.3.1 Environment: spark 2.3.1 mac scala 2.11.12 Reporter: uwe hi, running the attached code results in a {code:java} scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) {code} # i do think this is wrong (at least i do not see the issue in my code) # the error is the ein 90% of the cases (it sometimes passes). that makes me think something weird is going on {code:java} package misc import java.sql.Timestamp import java.time.LocalDateTime import java.time.format.DateTimeFormatter import org.apache.spark.rdd.RDD import org.apache.spark.sql.sources._ import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType} import org.apache.spark.sql.{Row, SQLContext, SparkSession} case class LogRecord(application:String, dateTime: Timestamp, component: String, level: String, message: String) class LogRelation(val sqlContext: SQLContext, val path: String) extends BaseRelation with PrunedFilteredScan { override def schema: StructType = StructType(Seq( StructField("application", StringType, false), StructField("dateTime", TimestampType, false), StructField("component", StringType, false), StructField("level", StringType, false), StructField("message", StringType, false))) override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = { val str = "2017-02-09T00:09:27" val ts =Timestamp.valueOf(LocalDateTime.parse(str, DateTimeFormatter.ISO_LOCAL_DATE_TIME)) val data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess")) sqlContext.sparkContext.parallelize(data) } } class LogDataSource extends DataSourceRegister with RelationProvider { override def shortName(): String = "log" override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = new LogRelation(sqlContext, parameters("path")) } object f0 extends App { lazy val spark: SparkSession = SparkSession.builder().master("local").appName("spark session").getOrCreate() val df = spark.read.format("log").load("hdfs:///logs") df.show() } {code} results in the following stacktrace {noformat} 11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - Task 0 in stage 0.0 failed 1 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567986#comment-16567986 ] LIFULONG commented on SPARK-24928: -- for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) code from CartesianRDD.compute() method,looks like it will load right rdd from text for each record in left rdd. > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23911) High-order function: reduce(array, initialState S, inputFunction, outputFunction) → R
[ https://issues.apache.org/jira/browse/SPARK-23911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567963#comment-16567963 ] Apache Spark commented on SPARK-23911: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21982 > High-order function: reduce(array, initialState S, inputFunction, > outputFunction) → R > --- > > Key: SPARK-23911 > URL: https://issues.apache.org/jira/browse/SPARK-23911 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Herman van Hovell >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns a single value reduced from array. inputFunction will be invoked for > each element in array in order. In addition to taking the element, > inputFunction takes the current state, initially initialState, and returns > the new state. outputFunction will be invoked to turn the final state into > the result value. It may be the identity function (i -> i). > {noformat} > SELECT reduce(ARRAY [], 0, (s, x) -> s + x, s -> s); -- 0 > SELECT reduce(ARRAY [5, 20, 50], 0, (s, x) -> s + x, s -> s); -- 75 > SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + x, s -> s); -- NULL > SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + COALESCE(x, 0), s -> > s); -- 75 > SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> IF(x IS NULL, s, s + x), > s -> s); -- 75 > SELECT reduce(ARRAY [2147483647, 1], CAST (0 AS BIGINT), (s, x) -> s + x, s > -> s); -- 2147483648 > SELECT reduce(ARRAY [5, 6, 10, 20], -- calculates arithmetic average: 10.25 > CAST(ROW(0.0, 0) AS ROW(sum DOUBLE, count INTEGER)), > (s, x) -> CAST(ROW(x + s.sum, s.count + 1) AS ROW(sum DOUBLE, > count INTEGER)), > s -> IF(s.count = 0, NULL, s.sum / s.count)); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23909) High-order function: filter(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-23909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567941#comment-16567941 ] Apache Spark commented on SPARK-23909: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21982 > High-order function: filter(array, function) → array > -- > > Key: SPARK-23909 > URL: https://issues.apache.org/jira/browse/SPARK-23909 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Herman van Hovell >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Constructs an array from those elements of array for which function returns > true: > {noformat} > SELECT filter(ARRAY [], x -> true); -- [] > SELECT filter(ARRAY [5, -6, NULL, 7], x -> x > 0); -- [5, 7] > SELECT filter(ARRAY [5, NULL, 7, NULL], x -> x IS NOT NULL); -- [5, 7] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24993) Make Avro fast again
[ https://issues.apache.org/jira/browse/SPARK-24993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24993. - Resolution: Fixed Fix Version/s: 2.4.0 > Make Avro fast again > > > Key: SPARK-24993 > URL: https://issues.apache.org/jira/browse/SPARK-24993 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24993) Make Avro fast again
[ https://issues.apache.org/jira/browse/SPARK-24993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-24993: Affects Version/s: (was: 2.3.0) 2.4.0 > Make Avro fast again > > > Key: SPARK-24993 > URL: https://issues.apache.org/jira/browse/SPARK-24993 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24989) BlockFetcher should retry while getting OutOfDirectMemoryError
[ https://issues.apache.org/jira/browse/SPARK-24989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian resolved SPARK-24989. - Resolution: Not A Problem The param `spark.reducer.maxBlocksInFlightPerAddress` added in SPARK-21243 can solve this problem, close this jira. Thanks [~Dhruve Ashar] ! > BlockFetcher should retry while getting OutOfDirectMemoryError > -- > > Key: SPARK-24989 > URL: https://issues.apache.org/jira/browse/SPARK-24989 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.2.0 >Reporter: Li Yuanjian >Priority: Major > Attachments: FailedStage.png > > > h3. Description > This problem can be reproduced stably by a large parallelism job migrate from > map reduce to Spark in our practice, some metrics list below: > ||Item||Value|| > |spark.executor.instances|1000| > |spark.executor.cores|5| > |task number of shuffle writer stage|18038| > |task number of shuffle reader stage|8| > While the shuffle writer stage successful ended, the shuffle reader stage > starting and keep failing by FetchFail. Each fetch request need the netty > sever allocate a buffer in 16MB(detailed stack attached below), the huge > amount of fetch request will use up default maxDirectMemory rapidly, even > though we bump up io.netty.maxDirectMemory to 50GB! > {code:java} > org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 > byte(s) of direct memory (used: 21474836480, max: 21474836480) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:514) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:445) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) > at > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:119) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 16777216 byte(s) of direct memory (use
[jira] [Resolved] (SPARK-25009) Standalone Cluster mode application submit is not working
[ https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25009. - Resolution: Fixed Assignee: Devaraj K Fix Version/s: 2.4.0 > Standalone Cluster mode application submit is not working > - > > Key: SPARK-25009 > URL: https://issues.apache.org/jira/browse/SPARK-25009 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Devaraj K >Assignee: Devaraj K >Priority: Critical > Fix For: 2.4.0 > > > It is not showing any error while submitting but the app is not running and > as well as not showing in the web UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25011) Add PrefixSpan to __all__
[ https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25011. -- Resolution: Fixed Assignee: yuhao yang Fix Version/s: 2.4.0 Fixed in https://github.com/apache/spark/pull/21981 > Add PrefixSpan to __all__ > - > > Key: SPARK-25011 > URL: https://issues.apache.org/jira/browse/SPARK-25011 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.4.0 > > > Add PrefixSpan to __all__ in fpm.py -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org