[jira] [Resolved] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable
[ https://issues.apache.org/jira/browse/SPARK-31195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31195. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27962 [https://github.com/apache/spark/pull/27962] > Reuse days rebase functions of DateTimeUtils in DaysWritable > > > Key: SPARK-31195 > URL: https://issues.apache.org/jira/browse/SPARK-31195 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() > were added by the PR https://github.com/apache/spark/pull/27915. The ticket > aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the > functions to: > # Deduplicate code > # The functions were better tested, and cross checked by reading parquet > files saved by Spark 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable
[ https://issues.apache.org/jira/browse/SPARK-31195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31195: Assignee: Maxim Gekk > Reuse days rebase functions of DateTimeUtils in DaysWritable > > > Key: SPARK-31195 > URL: https://issues.apache.org/jira/browse/SPARK-31195 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() > were added by the PR https://github.com/apache/spark/pull/27915. The ticket > aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the > functions to: > # Deduplicate code > # The functions were better tested, and cross checked by reading parquet > files saved by Spark 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31139) Fileformat datasources (ORC, Json) case sensitivity regressions
[ https://issues.apache.org/jira/browse/SPARK-31139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063127#comment-17063127 ] Xiao Li commented on SPARK-31139: - ping [~viirya] [~dongjoon] > Fileformat datasources (ORC, Json) case sensitivity regressions > --- > > Key: SPARK-31139 > URL: https://issues.apache.org/jira/browse/SPARK-31139 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Tae-kyeom, Kim >Priority: Blocker > Attachments: FileBasedDataSourceSuite.scala.diff > > > In addition to https://issues.apache.org/jira/browse/SPARK-31116 > Not only parquet, json and orc also have case sensitivity issues. > Following demonstrate test failure based SPARK-31116's test cases. (diff of > FileBasedDataSourceSuite is in attachement) > > > {code:java} > [info] - SPARK-31116: Select simple columns correctly in case insensitive > manner *** FAILED *** (4 seconds, 277 milliseconds) [info] Results do not > match for query: [info] Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] > Relation[camelcase#56] json [info] [info] == Analyzed Logical Plan == [info] > camelcase: string [info] Relation[camelcase#56] json [info] [info] == > Optimized Logical Plan == [info] Relation[camelcase#56] json [info] [info] == > Physical Plan == [info] FileScan json [camelcase#56] Batched: false, > DataFilters: [], Format: JSON, Location: > InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-95f1357a-85c9-444f-bdcc-..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct [info] [info] == Results == [info] [info] == Results > == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] !struct<> > struct [info] ![A] [null] (QueryTest.scala:248) > [info] - SPARK-31116: Select nested columns correctly in case insensitive > manner *** FAILED *** (2 seconds, 117 milliseconds) [info] Results do not > match for query: [info] Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] > Relation[StructColumn#147] json [info] [info] == Analyzed Logical Plan == > [info] StructColumn: struct [info] > Relation[StructColumn#147] json [info] [info] == Optimized Logical Plan == > [info] Relation[StructColumn#147] json [info] [info] == Physical Plan == > [info] FileScan json [StructColumn#147] Batched: false, DataFilters: [], > Format: JSON, Location: > InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-f9ecd1a4-e5aa-4dd7-bdfd-..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct> [info] [info] > == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == > Spark Answer - 1 == [info] !struct<> > struct> [info] > ![[0,1]] [[null,null]] (QueryTest.scala:248) > [info] - SPARK-31116: Select nested columns correctly in case sensitive > manner *** FAILED *** (871 milliseconds) [info] Results do not match for > query: [info] Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] > Relation[StructColumn#329] json [info] [info] == Analyzed Logical Plan == > [info] StructColumn: struct [info] > Relation[StructColumn#329] json [info] [info] == Optimized Logical Plan == > [info] Relation[StructColumn#329] json [info] [info] == Physical Plan == > [info] FileScan json [StructColumn#329] Batched: false, DataFilters: [], > Format: JSON, Location: > InMemoryFileIndex[file:/Users/kimtkyeom/Dev/spark_devel/target/tmp/spark-612baf76-a9d0-41e5-89f4-..., > PartitionFilters: [],
[jira] [Closed] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-31193. - > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: daile >Priority: Major > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31193: -- Affects Version/s: (was: 2.4.5) (was: 2.4.4) (was: 2.4.3) (was: 2.4.2) (was: 2.3.3) (was: 2.4.0) (was: 2.3.0) > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: daile >Priority: Major > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31193. --- Resolution: Not A Bug > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: daile >Priority: Major > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31193: -- Target Version/s: (was: 3.1.0) > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0 >Reporter: daile >Priority: Major > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31193: -- Fix Version/s: (was: 3.1.0) > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0 >Reporter: daile >Priority: Major > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31171) size(null) should return null under ansi mode
[ https://issues.apache.org/jira/browse/SPARK-31171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31171: -- Parent: SPARK-31085 Issue Type: Sub-task (was: Improvement) > size(null) should return null under ansi mode > - > > Key: SPARK-31171 > URL: https://issues.apache.org/jira/browse/SPARK-31171 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063092#comment-17063092 ] Dongjoon Hyun commented on SPARK-31136: --- BTW, [~kabhwan]'s thread should be considered as another topic because it's about "Resolve ambiguous parser rule". > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31136 > URL: https://issues.apache.org/jira/browse/SPARK-31136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > We need to consider the behavior change of SPARK-30098 . > This is a placeholder to keep the discussion and the final decision. > `CREATE TABLE` syntax changes its behavior silently. > The following is one example of the breaking the existing user data pipelines. > *Apache Spark 2.4.5* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > spark-sql> SELECT * FROM t LIMIT 1; > # Apache Spark > Time taken: 2.05 seconds, Fetched 1 row(s) > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 3 > {code} > *Apache Spark 3.0.0-preview2* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > Error in query: LOAD DATA is not supported for datasource tables: > `default`.`t`; > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31136. --- Resolution: Won't Do Hi, All. This issue was specifically for `Reverting SPARK-30098`. Now, I'm closing this issue as "Won't Do" because we discussed here and we don't agree on. In other words, we are going to move forward instead of simply reverting this. SPARK-31147 will follow up for a proper action on CHAR type. For documentation, SPARK-31133 will follow up for the documentations (including syntax changes and meaning, and maybe `LOAD` behavior). We may open up more follow-ups, but not this. Also, for another request for reverting SPARK-30098, we will reuse this issue for further discussion. > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31136 > URL: https://issues.apache.org/jira/browse/SPARK-31136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > We need to consider the behavior change of SPARK-30098 . > This is a placeholder to keep the discussion and the final decision. > `CREATE TABLE` syntax changes its behavior silently. > The following is one example of the breaking the existing user data pipelines. > *Apache Spark 2.4.5* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > spark-sql> SELECT * FROM t LIMIT 1; > # Apache Spark > Time taken: 2.05 seconds, Fetched 1 row(s) > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 3 > {code} > *Apache Spark 3.0.0-preview2* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > Error in query: LOAD DATA is not supported for datasource tables: > `default`.`t`; > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31136: -- Labels: (was: correctness) > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31136 > URL: https://issues.apache.org/jira/browse/SPARK-31136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > We need to consider the behavior change of SPARK-30098 . > This is a placeholder to keep the discussion and the final decision. > `CREATE TABLE` syntax changes its behavior silently. > The following is one example of the breaking the existing user data pipelines. > *Apache Spark 2.4.5* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > spark-sql> SELECT * FROM t LIMIT 1; > # Apache Spark > Time taken: 2.05 seconds, Fetched 1 row(s) > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 3 > {code} > *Apache Spark 3.0.0-preview2* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > Error in query: LOAD DATA is not supported for datasource tables: > `default`.`t`; > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
[ https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31181: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Remove the default value assumption on CREATE TABLE test cases > -- > > Key: SPARK-31181 > URL: https://issues.apache.org/jira/browse/SPARK-31181 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
[ https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31181: -- Fix Version/s: (was: 3.1.0) 3.0.0 > Remove the default value assumption on CREATE TABLE test cases > -- > > Key: SPARK-31181 > URL: https://issues.apache.org/jira/browse/SPARK-31181 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
[ https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31181. --- Fix Version/s: 3.1.0 Resolution: Fixed > Remove the default value assumption on CREATE TABLE test cases > -- > > Key: SPARK-31181 > URL: https://issues.apache.org/jira/browse/SPARK-31181 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
[ https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31181: - Assignee: Dongjoon Hyun > Remove the default value assumption on CREATE TABLE test cases > -- > > Key: SPARK-31181 > URL: https://issues.apache.org/jira/browse/SPARK-31181 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31199) Separate connection timeout and idle timeout for shuffle
[ https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runnings updated SPARK-31199: - Comment: was deleted (was: cc [~rxin] [~aaron.davidson_impala_647b] who worked on [https://github.com/apache/spark/pull/5584] before ) > Separate connection timeout and idle timeout for shuffle > > > Key: SPARK-31199 > URL: https://issues.apache.org/jira/browse/SPARK-31199 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: runnings >Priority: Major > > spark.shuffle.io.connectionTimeout only used for connection timeout for > connection setup while spark.shuffle.io.idleTimeout is used to control how > long to kill the connection if it seems to be > idle([https://github.com/apache/spark/pull/5584]) > > These 2 timeouts could be quite different and shorten connectiontimeout could > help fast fail the shuffle task in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31199) Separate connection timeout and idle timeout for shuffle
[ https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063067#comment-17063067 ] runnings edited comment on SPARK-31199 at 3/20/20, 4:15 AM: cc [~rxin] [~aaron.davidson_impala_647b] who worked on [https://github.com/apache/spark/pull/5584] before was (Author: runnings): cc [~rxin] ** who worked on [https://github.com/apache/spark/pull/5584] before > Separate connection timeout and idle timeout for shuffle > > > Key: SPARK-31199 > URL: https://issues.apache.org/jira/browse/SPARK-31199 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: runnings >Priority: Major > > spark.shuffle.io.connectionTimeout only used for connection timeout for > connection setup while spark.shuffle.io.idleTimeout is used to control how > long to kill the connection if it seems to be > idle([https://github.com/apache/spark/pull/5584]) > > These 2 timeouts could be quite different and shorten connectiontimeout could > help fast fail the shuffle task in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31199) Separate connection timeout and idle timeout for shuffle
[ https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runnings updated SPARK-31199: - Description: spark.shuffle.io.connectionTimeout only used for connection timeout for connection setup while spark.shuffle.io.idleTimeout is used to control how long to kill the connection if it seems to be idle([https://github.com/apache/spark/pull/5584]) These 2 timeouts could be quite different and shorten connectiontimeout could help fast fail the shuffle task in some cases was: spark.shuffle.io.connectionTimeout only used for connection timeout for connection setup while spark.shuffle.io.idleTimeout is used to control how long to kill the connection if it seems to be idle([#27963|https://github.com/apache/spark/pull/27963]) These 2 timeouts could be quite different and shorten connectiontimeout could help fast fail the shuffle task in some cases > Separate connection timeout and idle timeout for shuffle > > > Key: SPARK-31199 > URL: https://issues.apache.org/jira/browse/SPARK-31199 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: runnings >Priority: Major > > spark.shuffle.io.connectionTimeout only used for connection timeout for > connection setup while spark.shuffle.io.idleTimeout is used to control how > long to kill the connection if it seems to be > idle([https://github.com/apache/spark/pull/5584]) > > These 2 timeouts could be quite different and shorten connectiontimeout could > help fast fail the shuffle task in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31199) Separate connection timeout and idle timeout for shuffle
[ https://issues.apache.org/jira/browse/SPARK-31199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063067#comment-17063067 ] runnings commented on SPARK-31199: -- cc [~rxin] ** who worked on [https://github.com/apache/spark/pull/5584] before > Separate connection timeout and idle timeout for shuffle > > > Key: SPARK-31199 > URL: https://issues.apache.org/jira/browse/SPARK-31199 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: runnings >Priority: Major > > spark.shuffle.io.connectionTimeout only used for connection timeout for > connection setup while spark.shuffle.io.idleTimeout is used to control how > long to kill the connection if it seems to be > idle([https://github.com/apache/spark/pull/5584]) > > These 2 timeouts could be quite different and shorten connectiontimeout could > help fast fail the shuffle task in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31199) Separate connection timeout and idle timeout for shuffle
runnings created SPARK-31199: Summary: Separate connection timeout and idle timeout for shuffle Key: SPARK-31199 URL: https://issues.apache.org/jira/browse/SPARK-31199 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.0 Reporter: runnings spark.shuffle.io.connectionTimeout only used for connection timeout for connection setup while spark.shuffle.io.idleTimeout is used to control how long to kill the connection if it seems to be idle([#27963|https://github.com/apache/spark/pull/27963]) These 2 timeouts could be quite different and shorten connectiontimeout could help fast fail the shuffle task in some cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063050#comment-17063050 ] Dongjoon Hyun commented on SPARK-30951: --- Thanks. According to [~cloud_fan] comment, `correctness` label is removed. However, it seems that we need more documents like the above [~cloud_fan]'s comment. > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Blocker > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar metadata, Hive's behavior is determined by a configuration > setting. This allows Hive to read legacy data (note: if the data already > consists of a mix of calendar types with no metadata, there is no good > solution).
[jira] [Updated] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30951: -- Labels: (was: correctness) > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Blocker > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar metadata, Hive's behavior is determined by a configuration > setting. This allows Hive to read legacy data (note: if the data already > consists of a mix of calendar types with no metadata, there is no good > solution). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apach
[jira] [Updated] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26293: - Fix Version/s: (was: 2.4.1) 2.4.6 > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0, 2.4.6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25121) Support multi-part column name for hint resolution
[ https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25121: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Support multi-part column name for hint resolution > -- > > Key: SPARK-25121 > URL: https://issues.apache.org/jira/browse/SPARK-25121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > After supporting multi-part names in > https://github.com/apache/spark/pull/17185, we also need to consider how to > resolve the hints for broadcast hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30932: Assignee: zhengruifeng > ML 3.0 QA: API: Java compatibility, docs > > > Key: SPARK-30932 > URL: https://issues.apache.org/jira/browse/SPARK-30932 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Attachments: 1_process_script.sh, added_ml_class, common_ml_class, > signature.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25121) Support multi-part column name for hint resolution
[ https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25121. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27935 [https://github.com/apache/spark/pull/27935] > Support multi-part column name for hint resolution > -- > > Key: SPARK-25121 > URL: https://issues.apache.org/jira/browse/SPARK-25121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > After supporting multi-part names in > https://github.com/apache/spark/pull/17185, we also need to consider how to > resolve the hints for broadcast hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30932) ML 3.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-30932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30932. -- Resolution: Fixed > ML 3.0 QA: API: Java compatibility, docs > > > Key: SPARK-30932 > URL: https://issues.apache.org/jira/browse/SPARK-30932 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Attachments: 1_process_script.sh, added_ml_class, common_ml_class, > signature.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25121) Support multi-part column name for hint resolution
[ https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25121: - Assignee: Takeshi Yamamuro > Support multi-part column name for hint resolution > -- > > Key: SPARK-25121 > URL: https://issues.apache.org/jira/browse/SPARK-25121 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > > After supporting multi-part names in > https://github.com/apache/spark/pull/17185, we also need to consider how to > resolve the hints for broadcast hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30935) Update MLlib, GraphX websites for 3.0
[ https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30935: Assignee: Huaxin Gao > Update MLlib, GraphX websites for 3.0 > - > > Key: SPARK-30935 > URL: https://issues.apache.org/jira/browse/SPARK-30935 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30931: Assignee: Huaxin Gao > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31170: -- Fix Version/s: (was: 3.0.0) > Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir > > > Key: SPARK-31170 > URL: https://issues.apache.org/jira/browse/SPARK-31170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > In Spark CLI, we create a hive CliSessionState and it does not load the > hive-site.xml. So the configurations in hive-site.xml will not take effects > like other spark-hive integration apps. > Also, the warehouse directory is not correctly picked. If the `default` > database does not exist, the CliSessionState will create one during the first > time it talks to the metastore. The `Location` of the default DB will be > neither the value of spark.sql.warehousr.dir nor the user-specified value of > hive.metastore.warehourse.dir, but the default value of > hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-31170: --- This is reverted because this broke all `hive-1.2` profile Jenkins jobs (2 SBT/2 Maven). > Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir > > > Key: SPARK-31170 > URL: https://issues.apache.org/jira/browse/SPARK-31170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > In Spark CLI, we create a hive CliSessionState and it does not load the > hive-site.xml. So the configurations in hive-site.xml will not take effects > like other spark-hive integration apps. > Also, the warehouse directory is not correctly picked. If the `default` > database does not exist, the CliSessionState will create one during the first > time it talks to the metastore. The `Location` of the default DB will be > neither the value of spark.sql.warehousr.dir nor the user-specified value of > hive.metastore.warehourse.dir, but the default value of > hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
Holden Karau created SPARK-31198: Summary: Use graceful decommissioning as part of dynamic scaling Key: SPARK-31198 URL: https://issues.apache.org/jira/browse/SPARK-31198 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.1.0 Reporter: Holden Karau Assignee: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31197) Exit the executor once all tasks & migrations are finished
Holden Karau created SPARK-31197: Summary: Exit the executor once all tasks & migrations are finished Key: SPARK-31197 URL: https://issues.apache.org/jira/browse/SPARK-31197 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.1.0 Reporter: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications
[ https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavol Vidlička updated SPARK-31196: --- Description: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute (much longer for more applications) and tends to freeze the browser. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use server [side processing of the DataTable|https://datatables.net/examples/data_sources/server_side]. This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. It could also definitely solve History UI scalability issues reported for example in SPARK-21254, SPARK-17243, SPARK-17671 was: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use server [side processing of the DataTable|https://datatables.net/examples/data_sources/server_side]. This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. It could also definitely solve History UI scalability issues reported for example in SPARK-21254, SPARK-17243, SPARK-17671 > Server-side processing of History UI list of applications > - > > Key: SPARK-31196 > URL: https://issues.apache.org/jira/browse/SPARK-31196 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 2.4.5 >Reporter: Pavol Vidlička >Priority: Minor > > Loading the list of applications in the History UI does not scale well with a > large number of applications. Fetching and rendering the list for 10k+ > applications takes over a minute (much longer for more applications) and > tends to freeze the browser. > Using `spark.history.ui.maxApplications` is not a great solution, because (as > the name implies), it limits the number of applications shown in the UI, > which hinders usability of the History Server. > A solution would be to use server [side processing of the > DataTable|https://datatables.net/examples/data_sources/server_side]. This > would limit amount of data sent to the client and processed by the browser. > This proposed change plays nicely with KVStore abstraction implemented in > SPARK-18085, which was supposed to solve some of the scalability issues. It > could also definitely solve History UI scalability issues reported for > example in SPARK-21254, SPARK-17243, SPARK-17671 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30836) Improve the decommissioning K8s integration tests
[ https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-30836: - Fix Version/s: 3.1.0 > Improve the decommissioning K8s integration tests > - > > Key: SPARK-30836 > URL: https://issues.apache.org/jira/browse/SPARK-30836 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.1.0 > > > See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; > [https://github.com/apache/spark/pull/26440#discussion_r373153511] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30836) Improve the decommissioning K8s integration tests
[ https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062944#comment-17062944 ] Holden Karau commented on SPARK-30836: -- Resolved in [https://github.com/apache/spark/pull/27905] > Improve the decommissioning K8s integration tests > - > > Key: SPARK-30836 > URL: https://issues.apache.org/jira/browse/SPARK-30836 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.1.0 > > > See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; > [https://github.com/apache/spark/pull/26440#discussion_r373153511] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30836) Improve the decommissioning K8s integration tests
[ https://issues.apache.org/jira/browse/SPARK-30836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-30836. -- Assignee: Holden Karau Resolution: Fixed > Improve the decommissioning K8s integration tests > - > > Key: SPARK-30836 > URL: https://issues.apache.org/jira/browse/SPARK-30836 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > See [https://github.com/apache/spark/pull/26440#discussion_r373155825] &; > [https://github.com/apache/spark/pull/26440#discussion_r373153511] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-20629: - Component/s: (was: Spark Core) Kubernetes > Copy shuffle data when nodes are being shut down using PVs > -- > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-20629: - Description: We decided not to do this for YARN, but for Kubernetes and similar systems nodes may be shut down entirely without the ability to keep an AuxiliaryService around. (was: We decided not to do this for YARN, but for EC2/GCE and similar systems nodes may be shut down entirely without the ability to keep an AuxiliaryService around.) > Copy shuffle data when nodes are being shut down using PVs > -- > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > We decided not to do this for YARN, but for Kubernetes and similar systems > nodes may be shut down entirely without the ability to keep an > AuxiliaryService around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20629) Copy shuffle data when nodes are being shut down using PVs
[ https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-20629: - Summary: Copy shuffle data when nodes are being shut down using PVs (was: Copy shuffle data when nodes are being shut down) > Copy shuffle data when nodes are being shut down using PVs > -- > > Key: SPARK-20629 > URL: https://issues.apache.org/jira/browse/SPARK-20629 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > We decided not to do this for YARN, but for EC2/GCE and similar systems nodes > may be shut down entirely without the ability to keep an AuxiliaryService > around. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30981) Fix flaky "Test basic decommissioning" test
[ https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-30981. -- Fix Version/s: 3.1.0 Assignee: Holden Karau Resolution: Fixed > Fix flaky "Test basic decommissioning" test > --- > > Key: SPARK-30981 > URL: https://issues.apache.org/jira/browse/SPARK-30981 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > - https://github.com/apache/spark/pull/27721 > {code} > - Test basic decommissioning *** FAILED *** > The code passed to eventually never returned normally. Attempted 126 times > over 2.010095245067 minutes. Last failure message: "++ id -u > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30981) Fix flaky "Test basic decommissioning" test
[ https://issues.apache.org/jira/browse/SPARK-30981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062942#comment-17062942 ] Holden Karau commented on SPARK-30981: -- I believe this was resolved in [https://github.com/apache/spark/pull/27905] > Fix flaky "Test basic decommissioning" test > --- > > Key: SPARK-30981 > URL: https://issues.apache.org/jira/browse/SPARK-30981 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > - https://github.com/apache/spark/pull/27721 > {code} > - Test basic decommissioning *** FAILED *** > The code passed to eventually never returned normally. Attempted 126 times > over 2.010095245067 minutes. Last failure message: "++ id -u > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications
[ https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavol Vidlička updated SPARK-31196: --- Affects Version/s: 2.4.5 > Server-side processing of History UI list of applications > - > > Key: SPARK-31196 > URL: https://issues.apache.org/jira/browse/SPARK-31196 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0, 2.4.5 >Reporter: Pavol Vidlička >Priority: Minor > > Loading the list of applications in the History UI does not scale well with a > large number of applications. Fetching and rendering the list for 10k+ > applications takes over a minute. > Using `spark.history.ui.maxApplications` is not a great solution, because (as > the name implies), it limits the number of applications shown in the UI, > which hinders usability of the History Server. > A solution would be to use server [side processing of the > DataTable|https://datatables.net/examples/data_sources/server_side]. This > would limit amount of data sent to the client and processed by the browser. > This proposed change plays nicely with KVStore abstraction implemented in > SPARK-18085, which was supposed to solve some of the scalability issues. It > could also definitely solve History UI scalability issues reported for > example in SPARK-21254, SPARK-17243, SPARK-17671 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications
[ https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavol Vidlička updated SPARK-31196: --- Description: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use server [side processing of the DataTable|https://datatables.net/examples/data_sources/server_side]. This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. It could also definitely solve History UI scalability issues reported for example in SPARK-21254, SPARK-17243, SPARK-17671 was: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use server [side processing of the DataTable|https://datatables.net/examples/data_sources/server_side]. This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. > Server-side processing of History UI list of applications > - > > Key: SPARK-31196 > URL: https://issues.apache.org/jira/browse/SPARK-31196 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Pavol Vidlička >Priority: Minor > > Loading the list of applications in the History UI does not scale well with a > large number of applications. Fetching and rendering the list for 10k+ > applications takes over a minute. > Using `spark.history.ui.maxApplications` is not a great solution, because (as > the name implies), it limits the number of applications shown in the UI, > which hinders usability of the History Server. > A solution would be to use server [side processing of the > DataTable|https://datatables.net/examples/data_sources/server_side]. This > would limit amount of data sent to the client and processed by the browser. > This proposed change plays nicely with KVStore abstraction implemented in > SPARK-18085, which was supposed to solve some of the scalability issues. It > could also definitely solve History UI scalability issues reported for > example in SPARK-21254, SPARK-17243, SPARK-17671 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31196) Server-side processing of History UI list of applications
[ https://issues.apache.org/jira/browse/SPARK-31196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavol Vidlička updated SPARK-31196: --- Description: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use server [side processing of the DataTable|https://datatables.net/examples/data_sources/server_side]. This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. was: Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use [server side processing of the DataTable](https://datatables.net/examples/data_sources/server_side). This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. > Server-side processing of History UI list of applications > - > > Key: SPARK-31196 > URL: https://issues.apache.org/jira/browse/SPARK-31196 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Pavol Vidlička >Priority: Minor > > Loading the list of applications in the History UI does not scale well with a > large number of applications. Fetching and rendering the list for 10k+ > applications takes over a minute. > Using `spark.history.ui.maxApplications` is not a great solution, because (as > the name implies), it limits the number of applications shown in the UI, > which hinders usability of the History Server. > A solution would be to use server [side processing of the > DataTable|https://datatables.net/examples/data_sources/server_side]. This > would limit amount of data sent to the client and processed by the browser. > This proposed change plays nicely with KVStore abstraction implemented in > SPARK-18085, which was supposed to solve some of the scalability issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31196) Server-side processing of History UI list of applications
Pavol Vidlička created SPARK-31196: -- Summary: Server-side processing of History UI list of applications Key: SPARK-31196 URL: https://issues.apache.org/jira/browse/SPARK-31196 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.0 Reporter: Pavol Vidlička Loading the list of applications in the History UI does not scale well with a large number of applications. Fetching and rendering the list for 10k+ applications takes over a minute. Using `spark.history.ui.maxApplications` is not a great solution, because (as the name implies), it limits the number of applications shown in the UI, which hinders usability of the History Server. A solution would be to use [server side processing of the DataTable](https://datatables.net/examples/data_sources/server_side). This would limit amount of data sent to the client and processed by the browser. This proposed change plays nicely with KVStore abstraction implemented in SPARK-18085, which was supposed to solve some of the scalability issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31195) Reuse days rebase functions of DateTimeUtils in DaysWritable
Maxim Gekk created SPARK-31195: -- Summary: Reuse days rebase functions of DateTimeUtils in DaysWritable Key: SPARK-31195 URL: https://issues.apache.org/jira/browse/SPARK-31195 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The functions rebaseJulianToGregorianDays() and rebaseGregorianToJulianDays() were added by the PR https://github.com/apache/spark/pull/27915. The ticket aims to replace similar code in org.apache.spark.sql.hive.DaysWritable by the functions to: # Deduplicate code # The functions were better tested, and cross checked by reading parquet files saved by Spark 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062898#comment-17062898 ] Udit Mehrotra commented on SPARK-29767: --- [~hyukjin.kwon] Can you take a look at it ? There has been no activity on this for months now. I have provided the executor dump. Please let me know if there is any more information I can provide to help drive this. > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: coredump.zip, hs_err_pid13885.log, > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet > > > Running a union operation on two DataFrames through both Scala Spark Shell > and PySpark, resulting in executor contains doing a *core dump* and existing > with Exit code 134. > The trace from the *Driver*: > {noformat} > Container exited with a non-zero exit code 134 > . > 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure > (executor 11 exited caused by one of the running tasks) Reason: Container > from a bad node: container_1572981097605_0021_01_77 on host: > ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_1572981097605_0021_01_77 > Exit code: 134 > Exception message: /bin/bash: line 1: 12611 Aborted > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack > trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted > > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/
[jira] [Commented] (SPARK-31173) Spark Kubernetes add tolerations and nodeName support
[ https://issues.apache.org/jira/browse/SPARK-31173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062785#comment-17062785 ] Jiaxin Shan commented on SPARK-31173: - I am trying to get more details. There's two level performance issues. # As every pod need to be mutated by webhook, it drags down to overall throughput. # Nodeselector, tolerations or node affinities have impact on kubernetes scheduler performance. Could I understand if you benchmark difference reflects both of above two points? BTW, Tolerations should be supported in PodTemplate in 3.0.0 release. > Spark Kubernetes add tolerations and nodeName support > - > > Key: SPARK-31173 > URL: https://issues.apache.org/jira/browse/SPARK-31173 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.1.0, 2.4.6 > Environment: Alibaba Cloud ACK with spark > operator(v1beta2-1.1.0-2.4.5) and spark(2.4.5) >Reporter: zhongwei liu >Priority: Trivial > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > When you run spark on serverless kubernetes cluster(virtual-kubelet). you > need to specific the nodeSelectors,tolerations even nodeName when you want to > gain better scheduling performance. Currently spark doesn't support > tolerations. If you want to use this feature, You must use admission > controller webhook to decorate the pod. But the performance is extremely bad. > Here is the benchmark. > With webhook > Batch Size: 500 Pod creation: about 7 Pods/s All Pods running: 5min > Without webhook > Batch Size: 500 Pod creation: more than 500 Pods/s All Pods running: 45s > Adding tolerations and nodeName in spark will bring great help when you want > to run a large scale job on serverless kubernetes cluster. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31191) Spark SQL and hive metastore are incompatible
[ https://issues.apache.org/jira/browse/SPARK-31191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leishuiyu updated SPARK-31191: -- Environment: the spark version 2.3.0 the hive version 2.3.3 was: the spark version 2.3.0 the hive version 2.3.3 > Spark SQL and hive metastore are incompatible > - > > Key: SPARK-31191 > URL: https://issues.apache.org/jira/browse/SPARK-31191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: the spark version 2.3.0 > the hive version 2.3.3 >Reporter: leishuiyu >Priority: Major > Fix For: 2.3.0 > > > # > h3. When I execute bin/spark-sql, an exception occurs > > {code:java} > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientCaused by: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > ... 12 moreCaused by: java.lang.reflect.InvocationTargetException at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) > ... 18 moreCaused by: MetaException(message:Hive Schema version 1.2.0 does > not match metastore's schema version 2.3.0 Metastore is not upgraded or > corrupt) at > org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6679) > at > org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) > at com.sun.proxy.$Proxy6.verifySchema(Unknown Source) at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > ... 23 more > {code} > h3. 2.Find the reason > query the source code, in spark jars directory have > hive-metastore-1.2.1.spark2.jar > the 1.2.1 version match 1.2.0 ,so generate the exception > > > {code:java} > //代码占位符 > private static final Map EQUIVALENT_VERSIONS = > ImmutableMap.of("0.13.1", "0.13.0", > "1.0.0", "0.14.0", > "1.0.1", "1.0.0", > "1.1.1", "1.1.0", > "1.2.1", "1.2.0" > ); > {code} > > h3. 3.Is there any solution to this problem > can edit hive-site.xml hive.metastore.schema.verification set true,but > new problems may arise > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31194) spark sql runs successfully with query not specifying condition next to where
[ https://issues.apache.org/jira/browse/SPARK-31194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayoub Omari updated SPARK-31194: Description: When having a sql query as follows: {color:#00875a}_SELECT *_{color} {color:#00875a}_FROM people_{color} {color:#00875a}_WHERE_{color} shouldn't we throw a parsing exception because of __unspecified _condition_ _?_ was: When having a sql query as follows: ``` SELECT * FROM people WHERE ``` shouldn't we throw a parsing exception because of __unspecified _condition_ _?_ > spark sql runs successfully with query not specifying condition next to where > -- > > Key: SPARK-31194 > URL: https://issues.apache.org/jira/browse/SPARK-31194 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.4.5 >Reporter: Ayoub Omari >Priority: Major > > When having a sql query as follows: > {color:#00875a}_SELECT *_{color} > {color:#00875a}_FROM people_{color} > {color:#00875a}_WHERE_{color} > shouldn't we throw a parsing exception because of __unspecified _condition_ > _?_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31194) spark sql runs successfully with query not specifying condition next to where
Ayoub Omari created SPARK-31194: --- Summary: spark sql runs successfully with query not specifying condition next to where Key: SPARK-31194 URL: https://issues.apache.org/jira/browse/SPARK-31194 Project: Spark Issue Type: Story Components: SQL Affects Versions: 2.4.5 Reporter: Ayoub Omari When having a sql query as follows: ``` SELECT * FROM people WHERE ``` shouldn't we throw a parsing exception because of __unspecified _condition_ _?_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daile updated SPARK-31193: -- Description: I see the default value of master setting in spark-submit client {code:java} // Global defaults. These should be keep to minimum to avoid confusing behavior. master = Option(master).getOrElse("local[*]") {code} but during our development and debugging, We will encounter this kind of problem Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration This conflicts with the default setting {code:java} //If we do val sparkConf = new SparkConf().setAppName(“app”) //When using the client to submit tasks to the cluster, the matser will be overwritten by the local sparkConf.set("spark.master", "local[*]"){code} so we have to do like this {code:java} val sparkConf = new SparkConf().setAppName(“app”) //Because the program runs to set the priority of the master, we have to first determine whether to set the master to avoid submitting the cluster to run. sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} so is spark.app.name Is it better for users to handle it like submit client ? was: {code:java} //代码占位符 {code} I see the default value of master setting in spark-submit client ```scala // Global defaults. These should be keep to minimum to avoid confusing behavior. master = Option(master).getOrElse("local[*]") ``` but during our development and debugging, We will encounter this kind of problem Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration This conflicts with the default setting ```scala //If we do val sparkConf = new SparkConf().setAppName(“app”) //When using the client to submit tasks to the cluster, the matser will be overwritten by the local sparkConf.set("spark.master", "local[*]") ``` so we have to do like this ```scala val sparkConf = new SparkConf().setAppName(“app”) //Because the program runs to set the priority of the master, we have to first determine whether to set the master to avoid submitting the cluster to run. sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")) ``` so is spark.app.name Is it better for users to handle it like submit client ? > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0 >Reporter: daile >Priority: Major > Fix For: 3.1.0 > > > I see the default value of master setting in spark-submit client > {code:java} > // Global defaults. These should be keep to minimum to avoid confusing > behavior. master = Option(master).getOrElse("local[*]") > {code} > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > > {code:java} > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]"){code} > > so we have to do like this > {code:java} > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")){code} > > > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31193) set spark.master and spark.app.name conf default value
[ https://issues.apache.org/jira/browse/SPARK-31193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daile updated SPARK-31193: -- Description: {code:java} //代码占位符 {code} I see the default value of master setting in spark-submit client ```scala // Global defaults. These should be keep to minimum to avoid confusing behavior. master = Option(master).getOrElse("local[*]") ``` but during our development and debugging, We will encounter this kind of problem Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration This conflicts with the default setting ```scala //If we do val sparkConf = new SparkConf().setAppName(“app”) //When using the client to submit tasks to the cluster, the matser will be overwritten by the local sparkConf.set("spark.master", "local[*]") ``` so we have to do like this ```scala val sparkConf = new SparkConf().setAppName(“app”) //Because the program runs to set the priority of the master, we have to first determine whether to set the master to avoid submitting the cluster to run. sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")) ``` so is spark.app.name Is it better for users to handle it like submit client ? > set spark.master and spark.app.name conf default value > -- > > Key: SPARK-31193 > URL: https://issues.apache.org/jira/browse/SPARK-31193 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.3, 2.4.0, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.1.0 >Reporter: daile >Priority: Major > Fix For: 3.1.0 > > > > > {code:java} > //代码占位符 > {code} > I see the default value of master setting in spark-submit client > > > ```scala > // Global defaults. These should be keep to minimum to avoid confusing > behavior. > master = Option(master).getOrElse("local[*]") > ``` > but during our development and debugging, We will encounter this kind of > problem > Exception in thread "main" org.apache.spark.SparkException: A master URL must > be set in your configuration > This conflicts with the default setting > ```scala > //If we do > val sparkConf = new SparkConf().setAppName(“app”) > //When using the client to submit tasks to the cluster, the matser will be > overwritten by the local > sparkConf.set("spark.master", "local[*]") > ``` > so we have to do like this > ```scala > val sparkConf = new SparkConf().setAppName(“app”) > //Because the program runs to set the priority of the master, we have to > first determine whether to set the master to avoid submitting the cluster to > run. > sparkConf.set("spark.master",sparkConf.get("spark.master","local[*]")) > ``` > so is spark.app.name > Is it better for users to handle it like submit client ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31193) set spark.master and spark.app.name conf default value
daile created SPARK-31193: - Summary: set spark.master and spark.app.name conf default value Key: SPARK-31193 URL: https://issues.apache.org/jira/browse/SPARK-31193 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.0, 2.3.3, 2.3.0, 3.1.0 Reporter: daile Fix For: 3.1.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31192) Introduce PushProjectThroughLimit
Ali Afroozeh created SPARK-31192: Summary: Introduce PushProjectThroughLimit Key: SPARK-31192 URL: https://issues.apache.org/jira/browse/SPARK-31192 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ali Afroozeh Currently the {{CollapseProject}} rule does many things: not only it collapses stacked projects, but also pushes down projects into limits, windows, etc. In this PR we factored out rules from {{CollapseProject}} that were pushing projects into limits and introduced a new rule called {{PushProjectThroughLimit.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31191) Spark SQL and hive metastore are incompatible
leishuiyu created SPARK-31191: - Summary: Spark SQL and hive metastore are incompatible Key: SPARK-31191 URL: https://issues.apache.org/jira/browse/SPARK-31191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: the spark version 2.3.0 the hive version 2.3.3 Reporter: leishuiyu Fix For: 2.3.0 # h3. When I execute bin/spark-sql, an exception occurs {code:java} Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientCaused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) ... 12 moreCaused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) ... 18 moreCaused by: MetaException(message:Hive Schema version 1.2.0 does not match metastore's schema version 2.3.0 Metastore is not upgraded or corrupt) at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6679) at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114) at com.sun.proxy.$Proxy6.verifySchema(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) ... 23 more {code} h3. 2.Find the reason query the source code, in spark jars directory have hive-metastore-1.2.1.spark2.jar the 1.2.1 version match 1.2.0 ,so generate the exception {code:java} //代码占位符 private static final Map EQUIVALENT_VERSIONS = ImmutableMap.of("0.13.1", "0.13.0", "1.0.0", "0.14.0", "1.0.1", "1.0.0", "1.1.1", "1.1.0", "1.2.1", "1.2.0" ); {code} h3. 3.Is there any solution to this problem can edit hive-site.xml hive.metastore.schema.verification set true,but new problems may arise -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31190) ScalaReflection should erasure non user defined AnyVal type
wuyi created SPARK-31190: Summary: ScalaReflection should erasure non user defined AnyVal type Key: SPARK-31190 URL: https://issues.apache.org/jira/browse/SPARK-31190 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi We should only not do erasure for non user defined AnyVal type, but still do erasure for other types, e.g. Any, which could give better error message for end user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31187) Sort the whole-stage codegen debug output by codegenStageId
[ https://issues.apache.org/jira/browse/SPARK-31187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31187. -- Fix Version/s: 3.0.0 Assignee: Kris Mok Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27955] > Sort the whole-stage codegen debug output by codegenStageId > --- > > Key: SPARK-31187 > URL: https://issues.apache.org/jira/browse/SPARK-31187 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Minor > Fix For: 3.0.0 > > > Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to > help with debugging. One way to get the generated code is through > {{df.queryExecution.debug.codegen}}, or SQL {{explain codegen}} statement. > The generated code is currently printed without specific ordering, which can > make debugging a bit annoying. This ticket tracks a minor improvement to sort > the codegen dump by the {{codegenStageId}}, ascending. > After this change, the following query: > {code} > spark.range(10).agg(sum('id)).queryExecution.debug.codegen > {code} > will always dump the generated code in a natural, stable order. > The number of codegen stages within a single SQL query tends to be very > small, most likely < 50, so the overhead of adding the sorting shouldn't be > significant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31188) Spark shell version miss match
[ https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062472#comment-17062472 ] Timmanna Channal edited comment on SPARK-31188 at 3/19/20, 11:30 AM: - I am very new to the spark issue space. Should I resolve the ticket ?. was (Author: timmanna): I am very new the the spark issue space. Should I resolve the ticket ?. > Spark shell version miss match > -- > > Key: SPARK-31188 > URL: https://issues.apache.org/jira/browse/SPARK-31188 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: On the standalone ubuntu machine i tried. >Reporter: Timmanna Channal >Priority: Blocker > Attachments: screenshot-1.png > > > Hi Team, > I downloaded the spark3.x latest tar ball from the spark website. > when tried to access the spark-shell I am getting version as 2.4.4. Attaching > the screen short. > > !screenshot-1.png! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31188) Spark shell version miss match
[ https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062472#comment-17062472 ] Timmanna Channal commented on SPARK-31188: -- I am very new the the spark issue space. Should I resolve the ticket ?. > Spark shell version miss match > -- > > Key: SPARK-31188 > URL: https://issues.apache.org/jira/browse/SPARK-31188 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: On the standalone ubuntu machine i tried. >Reporter: Timmanna Channal >Priority: Blocker > Attachments: screenshot-1.png > > > Hi Team, > I downloaded the spark3.x latest tar ball from the spark website. > when tried to access the spark-shell I am getting version as 2.4.4. Attaching > the screen short. > > !screenshot-1.png! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31188) Spark shell version miss match
[ https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062471#comment-17062471 ] Timmanna Channal commented on SPARK-31188: -- Hi Kent, Thanks it worked. I had set SPARK_HOME to spark2.4.4 version. But as you can see in the attachment I was inside the spark3.x folder. But i didn't understand why it was going to spark-2.4.4 scripts. > Spark shell version miss match > -- > > Key: SPARK-31188 > URL: https://issues.apache.org/jira/browse/SPARK-31188 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: On the standalone ubuntu machine i tried. >Reporter: Timmanna Channal >Priority: Blocker > Attachments: screenshot-1.png > > > Hi Team, > I downloaded the spark3.x latest tar ball from the spark website. > when tried to access the spark-shell I am getting version as 2.4.4. Attaching > the screen short. > > !screenshot-1.png! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31188) Spark shell version miss match
[ https://issues.apache.org/jira/browse/SPARK-31188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062457#comment-17062457 ] Kent Yao commented on SPARK-31188: -- I guess you may be setting your SPARK_HOME to a wrong place > Spark shell version miss match > -- > > Key: SPARK-31188 > URL: https://issues.apache.org/jira/browse/SPARK-31188 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 > Environment: On the standalone ubuntu machine i tried. >Reporter: Timmanna Channal >Priority: Blocker > Attachments: screenshot-1.png > > > Hi Team, > I downloaded the spark3.x latest tar ball from the spark website. > when tried to access the spark-shell I am getting version as 2.4.4. Attaching > the screen short. > > !screenshot-1.png! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF
[ https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062355#comment-17062355 ] Wenchen Fan commented on SPARK-30989: - https://github.com/apache/spark/pull/27916 can't fix it? I don't have a strong opnion as there is no clear rule about how we retain the df alias after many transformations. > TABLE.COLUMN reference doesn't work with new columns created by UDF > --- > > Key: SPARK-30989 > URL: https://issues.apache.org/jira/browse/SPARK-30989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Chris Suchanek >Priority: Major > > When a dataframe is created with an alias (`.as("...")`) its columns can be > referred as `TABLE.COLUMN` but it doesn't work for newly created columns with > UDF. > {code:java} > // code placeholder > df1 = sc.parallelize(l).toDF("x","y").as("cat") > val squared = udf((s: Int) => s * s) > val df2 = df1.withColumn("z", squared(col("y"))) > df2.columns //Array[String] = Array(x, y, z) > df2.select("cat.x") // works > df2.select("cat.z") // Doesn't work > // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given > input > // columns: [cat.x, cat.y, z];; > {code} > Might be related to: https://issues.apache.org/jira/browse/SPARK-30532 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31189) Fix errors and missing parts for datetime pattern document
Kent Yao created SPARK-31189: Summary: Fix errors and missing parts for datetime pattern document Key: SPARK-31189 URL: https://issues.apache.org/jira/browse/SPARK-31189 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Fix errors and missing parts for datetime pattern document -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF
[ https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062332#comment-17062332 ] hemanth meka commented on SPARK-30989: -- [~cloud_fan] or [~viirya], can you confirm if this needs a fix? I can work on this if needed. > TABLE.COLUMN reference doesn't work with new columns created by UDF > --- > > Key: SPARK-30989 > URL: https://issues.apache.org/jira/browse/SPARK-30989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Chris Suchanek >Priority: Major > > When a dataframe is created with an alias (`.as("...")`) its columns can be > referred as `TABLE.COLUMN` but it doesn't work for newly created columns with > UDF. > {code:java} > // code placeholder > df1 = sc.parallelize(l).toDF("x","y").as("cat") > val squared = udf((s: Int) => s * s) > val df2 = df1.withColumn("z", squared(col("y"))) > df2.columns //Array[String] = Array(x, y, z) > df2.select("cat.x") // works > df2.select("cat.z") // Doesn't work > // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given > input > // columns: [cat.x, cat.y, z];; > {code} > Might be related to: https://issues.apache.org/jira/browse/SPARK-30532 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
[ https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062333#comment-17062333 ] Xiaochen Ouyang commented on SPARK-25004: - [~rdblue] This configuration can only control the worker.py process, and the maximum memory limit of the derived child process cannot be controlled. Worker(JVM) --> Executor–> python.demon–>python.demon , the last python demon process can not be controlled by this configuration. > Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS > -- > > Key: SPARK-25004 > URL: https://issues.apache.org/jira/browse/SPARK-25004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > Some platforms support limiting Python's addressable memory space by limiting > [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS]. > We've found that adding a limit is very useful when running in YARN because > when Python doesn't know about memory constraints, it doesn't know when to > garbage collect and will continue using memory when it doesn't need to. > Adding a limit reduces PySpark memory consumption and avoids YARN killing > containers because Python hasn't cleaned up memory. > This also improves error messages for users, allowing them to see when Python > is allocating too much memory instead of YARN killing the container: > {code:lang=python} > File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in > fe_engineer > fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) > File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp > comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, > []), mat_rec_prep.get(item, [])) > File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in > leven_list_compare > permutations = sorted(permutations, reverse=True) > MemoryError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org