[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if pattern contains any escapeChar
[ https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33677: -- Summary: LikeSimplification should be skipped if pattern contains any escapeChar (was: LikeSimplification should be skipped if escape is a wildcard character) > LikeSimplification should be skipped if pattern contains any escapeChar > --- > > Key: SPARK-33677 > URL: https://issues.apache.org/jira/browse/SPARK-33677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > > LikeSimplification rule does not work correctly for many cases that have > patterns containing escape characters: > {code:sql} > SELECT s LIKE 'm%aca' ESCAPE '%' from t; > SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character
[ https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33677: -- Description: LikeSimplification rule does not work correctly for many cases that have patterns containing escape characters: {code:sql} SELECT s LIKE 'm%aca' ESCAPE '%' from t; SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t; {code} was: Spark SQL should throw exceptions when pattern string is invalid: {code:sql} SELECT a LIKE 'm%aca' ESCAPE '%' from t; {code} > LikeSimplification should be skipped if escape is a wildcard character > -- > > Key: SPARK-33677 > URL: https://issues.apache.org/jira/browse/SPARK-33677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > > LikeSimplification rule does not work correctly for many cases that have > patterns containing escape characters: > {code:sql} > SELECT s LIKE 'm%aca' ESCAPE '%' from t; > SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character
[ https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33677: -- Description: Spark SQL should throw exceptions when pattern string is invalid: {code:sql} SELECT a LIKE 'm%aca' ESCAPE '%' from t; {code} was: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} > LikeSimplification should be skipped if escape is a wildcard character > -- > > Key: SPARK-33677 > URL: https://issues.apache.org/jira/browse/SPARK-33677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > > Spark SQL should throw exceptions when pattern string is invalid: > {code:sql} > SELECT a LIKE 'm%aca' ESCAPE '%' from t; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character
[ https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33677: -- Fix Version/s: (was: 3.1.0) Affects Version/s: (was: 3.0.1) 3.1.0 > LikeSimplification should be skipped if escape is a wildcard character > -- > > Key: SPARK-33677 > URL: https://issues.apache.org/jira/browse/SPARK-33677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Assignee: Lu Lu >Priority: Major > > In ANSI mode, schema string parsing should fail if the schema uses ANSI > reserved keyword as attribute name: > {code:scala} > spark.conf.set("spark.sql.ansi.enabled", "true") > spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > == SQL == > time Timestamp > ^^^ > {code} > But this query may accidentally succeed in certain cases cause the DataType > parser sticks to the configs of the first created session in the current > thread: > {code:scala} > DataType.fromDDL("time Timestamp") > val newSpark = spark.newSession() > newSpark.conf.set("spark.sql.ansi.enabled", "true") > newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > ++ > |from_json({"time":"26/10/2015"})| > ++ > |{2015-10-26 00:00...| > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character
Lu Lu created SPARK-33677: - Summary: LikeSimplification should be skipped if escape is a wildcard character Key: SPARK-33677 URL: https://issues.apache.org/jira/browse/SPARK-33677 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Lu Lu Assignee: Lu Lu Fix For: 3.1.0 In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute
Lu Lu created SPARK-33614: - Summary: Fix the constant folding rule to skip it if the expression fails to execute Key: SPARK-33614 URL: https://issues.apache.org/jira/browse/SPARK-33614 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lu Lu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33432: -- Description: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} was: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} > SQL parser should use active SQLConf > > > Key: SPARK-33432 > URL: https://issues.apache.org/jira/browse/SPARK-33432 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lu Lu >Priority: Major > > In ANSI mode, schema string parsing should fail if the schema uses ANSI > reserved keyword as attribute name: > {code:scala} > spark.conf.set("spark.sql.ansi.enabled", "true") > spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > == SQL == > time Timestamp > ^^^ > {code} > But this query may accidentally succeed in certain cases cause the DataType > parser sticks to the configs of the first created session in the current > thread: > {code:scala} > DataType.fromDDL("time Timestamp") > val newSpark = spark.newSession() > newSpark.conf.set("spark.sql.ansi.enabled", "true") > newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > ++ > |from_json({"time":"26/10/2015"})| > ++ > |{2015-10-26 00:00...| > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33432: -- Description: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} was: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} > SQL parser should use active SQLConf > > > Key: SPARK-33432 > URL: https://issues.apache.org/jira/browse/SPARK-33432 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lu Lu >Priority: Major > > In ANSI mode, schema string parsing should fail if the schema uses ANSI > reserved keyword as attribute name: > {code:scala} > spark.conf.set("spark.sql.ansi.enabled", "true") > spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > == SQL == > time Timestamp > ^^^ > {code} > But this query may succeed in certain cases cause the DataType parser sticks > to the configs of the first created session in the current thread: > {code:scala} > DataType.fromDDL("time Timestamp") > val newSpark = spark.newSession() > newSpark.conf.set("spark.sql.ansi.enabled", "true") > newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > ++ > |from_json({"time":"26/10/2015"})| > ++ > |{2015-10-26 00:00...| > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33432: -- Summary: SQL parser should use active SQLConf (was: DataType parser should use active SQLConf) > SQL parser should use active SQLConf > > > Key: SPARK-33432 > URL: https://issues.apache.org/jira/browse/SPARK-33432 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lu Lu >Priority: Major > > In ANSI mode, schema string parsing should fail if the schema uses ANSI > reserved keyword as attribute name: > {code:scala} > spark.conf.set("spark.sql.ansi.enabled", "true") > spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > == SQL == > time Timestamp > ^^^ > {code} > But this query may succeed in certain cases cause the DataType parser sticks > to the configs of the first created session in the current thread: > {code:scala} > DataType.fromDDL("time Timestamp") > val newSpark = spark.newSession() > newSpark.conf.set("spark.sql.ansi.enabled", "true") > newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > ++ > |from_json({"time":"26/10/2015"})| > ++ > |{2015-10-26 00:00...| > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33432) DataType parser should use active SQLConf
[ https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33432: -- Description: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ |{2015-10-26 00:00...| ++ {code} was: In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ | {2015-10-26 00:00...| ++ {code} > DataType parser should use active SQLConf > - > > Key: SPARK-33432 > URL: https://issues.apache.org/jira/browse/SPARK-33432 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Lu Lu >Priority: Major > > In ANSI mode, schema string parsing should fail if the schema uses ANSI > reserved keyword as attribute name: > {code:scala} > spark.conf.set("spark.sql.ansi.enabled", "true") > spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > == SQL == > time Timestamp > ^^^ > {code} > But this query may succeed in certain cases cause the DataType parser sticks > to the configs of the first created session in the current thread: > {code:scala} > DataType.fromDDL("time Timestamp") > val newSpark = spark.newSession() > newSpark.conf.set("spark.sql.ansi.enabled", "true") > newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', > map('timestampFormat', 'dd/MM/'));""").show > output: > ++ > |from_json({"time":"26/10/2015"})| > ++ > |{2015-10-26 00:00...| > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33432) DataType parser should use active SQLConf
Lu Lu created SPARK-33432: - Summary: DataType parser should use active SQLConf Key: SPARK-33432 URL: https://issues.apache.org/jira/browse/SPARK-33432 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Lu Lu In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: {code:scala} spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: Cannot parse the data type: no viable alternative at input 'time'(line 1, pos 0) == SQL == time Timestamp ^^^ {code} But this query may succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: {code:scala} DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/'));""").show output: ++ |from_json({"time":"26/10/2015"})| ++ | {2015-10-26 00:00...| ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33389) make internal classes of SparkSession always using active SQLConf
Lu Lu created SPARK-33389: - Summary: make internal classes of SparkSession always using active SQLConf Key: SPARK-33389 URL: https://issues.apache.org/jira/browse/SPARK-33389 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lu Lu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33140) make all sub-class of Rule[QueryPlan] using SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33140: -- Summary: make all sub-class of Rule[QueryPlan] using SQLConf.get (was: make Analyzer rules using SQLConf.get) > make all sub-class of Rule[QueryPlan] using SQLConf.get > --- > > Key: SPARK-33140 > URL: https://issues.apache.org/jira/browse/SPARK-33140 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33008) Division by zero on divide-like operations returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223341#comment-17223341 ] Lu Lu commented on SPARK-33008: --- Please assign this to me. [~cloud_fan] > Division by zero on divide-like operations returns incorrect result > --- > > Key: SPARK-33008 > URL: https://issues.apache.org/jira/browse/SPARK-33008 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Priority: Major > Fix For: 3.1.0 > > > Spark SQL: > {code:sql} > spark-sql> SELECT 1/0; > NULL > Time taken: 3.002 seconds, Fetched 1 row(s) > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT 1/0; > ERROR: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33008) Division by zero on divide-like operations returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33008: -- Summary: Division by zero on divide-like operations returns incorrect result (was: Throw exception on division by zero) > Division by zero on divide-like operations returns incorrect result > --- > > Key: SPARK-33008 > URL: https://issues.apache.org/jira/browse/SPARK-33008 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT 1/0; > NULL > Time taken: 3.002 seconds, Fetched 1 row(s) > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT 1/0; > ERROR: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33008) Throw exception on division by zero
Lu Lu created SPARK-33008: - Summary: Throw exception on division by zero Key: SPARK-33008 URL: https://issues.apache.org/jira/browse/SPARK-33008 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Lu Lu Spark SQL: {code:java} spark-sql> SELECT 1/0; NULL Time taken: 3.002 seconds, Fetched 1 row(s) {code} PostgreSQL: {code:java} postgres=# SELECT 1/0; ERROR: division by zero {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33008) Throw exception on division by zero
[ https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33008: -- Description: Spark SQL: {code:sql} spark-sql> SELECT 1/0; NULL Time taken: 3.002 seconds, Fetched 1 row(s) {code} PostgreSQL: {code:sql} postgres=# SELECT 1/0; ERROR: division by zero {code} was: Spark SQL: {code:sql} spark-sql> SELECT 1/0; NULL Time taken: 3.002 seconds, Fetched 1 row(s) {code:sql} PostgreSQL: {code:java} postgres=# SELECT 1/0; ERROR: division by zero {code} > Throw exception on division by zero > --- > > Key: SPARK-33008 > URL: https://issues.apache.org/jira/browse/SPARK-33008 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT 1/0; > NULL > Time taken: 3.002 seconds, Fetched 1 row(s) > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT 1/0; > ERROR: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33008) Throw exception on division by zero
[ https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-33008: -- Description: Spark SQL: {code:sql} spark-sql> SELECT 1/0; NULL Time taken: 3.002 seconds, Fetched 1 row(s) {code:sql} PostgreSQL: {code:java} postgres=# SELECT 1/0; ERROR: division by zero {code} was: Spark SQL: {code:java} spark-sql> SELECT 1/0; NULL Time taken: 3.002 seconds, Fetched 1 row(s) {code} PostgreSQL: {code:java} postgres=# SELECT 1/0; ERROR: division by zero {code} > Throw exception on division by zero > --- > > Key: SPARK-33008 > URL: https://issues.apache.org/jira/browse/SPARK-33008 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lu Lu >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT 1/0; > NULL > Time taken: 3.002 seconds, Fetched 1 row(s) > {code:sql} > PostgreSQL: > {code:java} > postgres=# SELECT 1/0; > ERROR: division by zero > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4115) [GraphX] add overrided count for EdgeRDD
[ https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-4115: - Summary: [GraphX] add overrided count for EdgeRDD (was: add overrided count for EdgeRDD) > [GraphX] add overrided count for EdgeRDD > > > Key: SPARK-4115 > URL: https://issues.apache.org/jira/browse/SPARK-4115 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Lu Lu >Priority: Minor > Fix For: 1.1.1 > > > Add overrided count for edge counting of EdgeRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4115) add overrided count for EdgeRDD
Lu Lu created SPARK-4115: Summary: add overrided count for EdgeRDD Key: SPARK-4115 URL: https://issues.apache.org/jira/browse/SPARK-4115 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.1.0 Reporter: Lu Lu Priority: Minor Fix For: 1.1.1 Add overrided count for edge counting of EdgeRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4109) Task.stageId is not been deserialized correctly
Lu Lu created SPARK-4109: Summary: Task.stageId is not been deserialized correctly Key: SPARK-4109 URL: https://issues.apache.org/jira/browse/SPARK-4109 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.0.0 Reporter: Lu Lu Fix For: 1.0.3 The two subclasses of Task, ShuffleMapTask and ResultTask, do not correctly deserialize stageId. Therefore, the accessing of TaskContext.stageId always returns zero value to the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same parent RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Description: if the joinning RDDs are originating from a same cached RDD, the DAGScheduler will submit redundant stages to compute and cache the common parent. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problems: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD immediately. But it cannot achieve data-locality for these tasks because that the block location information are unavailable when submiting the stages. was: if the joinning RDDs are originating from a same cached RDD, the DAGScheduler will submit redundant stages to compute and cache the common parent. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD immediately. But it cannot achieve data-locality for these tasks because that the block location information are unavailable when submiting the stages. Summary: Improve joinning RDDs that transformed from the same parent RDD (was: Improve joinning RDDs that transformed from the same cached RDD) > Improve joinning RDDs that transformed from the same parent RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD, the DAGScheduler > will submit redundant stages to compute and cache the common parent. > For example: > {code} > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > {code} > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problems: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD immediately. But it cannot achieve > data-locality for these tasks because that the block location information are > unavailable when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Description: if the joinning RDDs are originating from a same cached RDD, the DAGScheduler will submit redundant stages to compute and cache the common parent. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD immediately. But it cannot achieve data-locality for these tasks because that the block location information are unavailable when submiting the stages. was: if the joinning RDDs are originating from a same cached RDD, the DAGScheduler will submit redundant stages to compute and cache the common parent. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. > Improve joinning RDDs that transformed from the same cached RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD, the DAGScheduler > will submit redundant stages to compute and cache the common parent. > For example: > {code} > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > {code} > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problerm: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD immediately. But it cannot achieve > data-locality for these tasks because that the block location information are > unavailable when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Description: if the joinning RDDs are originating from a same cached RDD, the DAGScheduler will submit redundant stages to compute and cache the common parent. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. was: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. > Improve joinning RDDs that transformed from the same cached RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD, the DAGScheduler > will submit redundant stages to compute and cache the common parent. > For example: > {code} > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > {code} > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problerm: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD directly. But it cannot achieve > data-locality for the latter stage because that the block location > information are not known when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2827) Add DegreeDist function support
Lu Lu created SPARK-2827: Summary: Add DegreeDist function support Key: SPARK-2827 URL: https://issues.apache.org/jira/browse/SPARK-2827 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Lu Lu Add degree distribution operators in GraphOps for GraphX. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Description: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: {code} val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count {code} The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. was: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: ``` val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count ``` The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. > Improve joinning RDDs that transformed from the same cached RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD a, the > DAGScheduler will submit redundant stages to compute and cache the RDD a. > For example: > {code} > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > {code} > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problerm: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD directly. But it cannot achieve > data-locality for the latter stage because that the block location > information are not known when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Description: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: ``` val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count ``` The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. was: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. > Improve joinning RDDs that transformed from the same cached RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD a, the > DAGScheduler will submit redundant stages to compute and cache the RDD a. > For example: > ``` > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > ``` > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problerm: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD directly. But it cannot achieve > data-locality for the latter stage because that the block location > information are not known when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2823: - Description: If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) was: If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 .lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.sp
[jira] [Created] (SPARK-2823) GraphX jobs throw IllegalArgumentException
Lu Lu created SPARK-2823: Summary: GraphX jobs throw IllegalArgumentException Key: SPARK-2823 URL: https://issues.apache.org/jira/browse/SPARK-2823 Project: Spark Issue Type: Bug Components: GraphX Reporter: Lu Lu If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException: 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception - job: 1 .lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1 97) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:272) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:274) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s cala:269) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
[ https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-2818: - Component/s: Spark Core Description: if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler will submit redundant stages to compute and cache the RDD a. For example: val edges = sc.textFile(...).cache() val bigSrc = edges.groupByKey().filter(...) val reversed = edges.map(edge => (edge._2, edge._1)) val bigDst = reversed.groupByKey().filter(...) bigSrc.join(bigDst).count The final count action will trigger two stages both to compute the edges RDD. It will result to two performance problerm: (1) if the resources are sufficient, these two stages will be running concurrently and read the same HDFS file at the same time. (2) if the two stages run one by one, the tasks of the latter stage can read the cached blocks of the edges RDD directly. But it cannot achieve data-locality for the latter stage because that the block location information are not known when submiting the stages. > Improve joinning RDDs that transformed from the same cached RDD > --- > > Key: SPARK-2818 > URL: https://issues.apache.org/jira/browse/SPARK-2818 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Lu Lu > > if the joinning RDDs are originating from a same cached RDD a, the > DAGScheduler will submit redundant stages to compute and cache the RDD a. > For example: > val edges = sc.textFile(...).cache() > val bigSrc = edges.groupByKey().filter(...) > val reversed = edges.map(edge => (edge._2, edge._1)) > val bigDst = reversed.groupByKey().filter(...) > bigSrc.join(bigDst).count > The final count action will trigger two stages both to compute the edges RDD. > It will result to two performance problerm: > (1) if the resources are sufficient, these two stages will be running > concurrently and read the same HDFS file at the same time. > (2) if the two stages run one by one, the tasks of the latter stage can read > the cached blocks of the edges RDD directly. But it cannot achieve > data-locality for the latter stage because that the block location > information are not known when submiting the stages. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD
Lu Lu created SPARK-2818: Summary: Improve joinning RDDs that transformed from the same cached RDD Key: SPARK-2818 URL: https://issues.apache.org/jira/browse/SPARK-2818 Project: Spark Issue Type: Improvement Reporter: Lu Lu -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org