[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if pattern contains any escapeChar

2020-12-06 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33677:
--
Summary: LikeSimplification should be skipped if pattern contains any 
escapeChar  (was: LikeSimplification should be skipped if escape is a wildcard 
character)

> LikeSimplification should be skipped if pattern contains any escapeChar
> ---
>
> Key: SPARK-33677
> URL: https://issues.apache.org/jira/browse/SPARK-33677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
>
> LikeSimplification rule does not work correctly for many cases that have 
> patterns containing escape characters:
> {code:sql}
> SELECT s LIKE 'm%aca' ESCAPE '%' from t;
> SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character

2020-12-06 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33677:
--
Description: 
LikeSimplification rule does not work correctly for many cases that have 
patterns containing escape characters:
{code:sql}
SELECT s LIKE 'm%aca' ESCAPE '%' from t;
SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t;
{code}

  was:
Spark SQL should throw exceptions when pattern string is invalid:
{code:sql}
SELECT a LIKE 'm%aca' ESCAPE '%' from t;
{code}


> LikeSimplification should be skipped if escape is a wildcard character
> --
>
> Key: SPARK-33677
> URL: https://issues.apache.org/jira/browse/SPARK-33677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
>
> LikeSimplification rule does not work correctly for many cases that have 
> patterns containing escape characters:
> {code:sql}
> SELECT s LIKE 'm%aca' ESCAPE '%' from t;
> SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character

2020-12-06 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33677:
--
Description: 
Spark SQL should throw exceptions when pattern string is invalid:
{code:sql}
SELECT a LIKE 'm%aca' ESCAPE '%' from t;
{code}

  was:
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}

But this query may accidentally succeed in certain cases cause the DataType 
parser sticks to the configs of the first created session in the current thread:

{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}


> LikeSimplification should be skipped if escape is a wildcard character
> --
>
> Key: SPARK-33677
> URL: https://issues.apache.org/jira/browse/SPARK-33677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
>
> Spark SQL should throw exceptions when pattern string is invalid:
> {code:sql}
> SELECT a LIKE 'm%aca' ESCAPE '%' from t;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character

2020-12-06 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33677:
--
Fix Version/s: (was: 3.1.0)
Affects Version/s: (was: 3.0.1)
   3.1.0

> LikeSimplification should be skipped if escape is a wildcard character
> --
>
> Key: SPARK-33677
> URL: https://issues.apache.org/jira/browse/SPARK-33677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Assignee: Lu Lu
>Priority: Major
>
> In ANSI mode, schema string parsing should fail if the schema uses ANSI 
> reserved keyword as attribute name:
> {code:scala}
> spark.conf.set("spark.sql.ansi.enabled", "true")
> spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> Cannot parse the data type: 
> no viable alternative at input 'time'(line 1, pos 0)
> == SQL ==
> time Timestamp
> ^^^
> {code}
> But this query may accidentally succeed in certain cases cause the DataType 
> parser sticks to the configs of the first created session in the current 
> thread:
> {code:scala}
> DataType.fromDDL("time Timestamp")
> val newSpark = spark.newSession()
> newSpark.conf.set("spark.sql.ansi.enabled", "true")
> newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> ++
> |from_json({"time":"26/10/2015"})|
> ++
> |{2015-10-26 00:00...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33677) LikeSimplification should be skipped if escape is a wildcard character

2020-12-06 Thread Lu Lu (Jira)
Lu Lu created SPARK-33677:
-

 Summary: LikeSimplification should be skipped if escape is a 
wildcard character
 Key: SPARK-33677
 URL: https://issues.apache.org/jira/browse/SPARK-33677
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Lu Lu
Assignee: Lu Lu
 Fix For: 3.1.0


In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}

But this query may accidentally succeed in certain cases cause the DataType 
parser sticks to the configs of the first created session in the current thread:

{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33614) Fix the constant folding rule to skip it if the expression fails to execute

2020-11-30 Thread Lu Lu (Jira)
Lu Lu created SPARK-33614:
-

 Summary: Fix the constant folding rule to skip it if the 
expression fails to execute
 Key: SPARK-33614
 URL: https://issues.apache.org/jira/browse/SPARK-33614
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lu Lu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf

2020-11-12 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33432:
--
Description: 
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}

But this query may accidentally succeed in certain cases cause the DataType 
parser sticks to the configs of the first created session in the current thread:

{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}

  was:
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}

But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:

{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}


> SQL parser should use active SQLConf
> 
>
> Key: SPARK-33432
> URL: https://issues.apache.org/jira/browse/SPARK-33432
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lu Lu
>Priority: Major
>
> In ANSI mode, schema string parsing should fail if the schema uses ANSI 
> reserved keyword as attribute name:
> {code:scala}
> spark.conf.set("spark.sql.ansi.enabled", "true")
> spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> Cannot parse the data type: 
> no viable alternative at input 'time'(line 1, pos 0)
> == SQL ==
> time Timestamp
> ^^^
> {code}
> But this query may accidentally succeed in certain cases cause the DataType 
> parser sticks to the configs of the first created session in the current 
> thread:
> {code:scala}
> DataType.fromDDL("time Timestamp")
> val newSpark = spark.newSession()
> newSpark.conf.set("spark.sql.ansi.enabled", "true")
> newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> ++
> |from_json({"time":"26/10/2015"})|
> ++
> |{2015-10-26 00:00...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf

2020-11-12 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33432:
--
Description: 
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}

But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:

{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}

  was:
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}
But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:
{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}


> SQL parser should use active SQLConf
> 
>
> Key: SPARK-33432
> URL: https://issues.apache.org/jira/browse/SPARK-33432
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lu Lu
>Priority: Major
>
> In ANSI mode, schema string parsing should fail if the schema uses ANSI 
> reserved keyword as attribute name:
> {code:scala}
> spark.conf.set("spark.sql.ansi.enabled", "true")
> spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> Cannot parse the data type: 
> no viable alternative at input 'time'(line 1, pos 0)
> == SQL ==
> time Timestamp
> ^^^
> {code}
> But this query may succeed in certain cases cause the DataType parser sticks 
> to the configs of the first created session in the current thread:
> {code:scala}
> DataType.fromDDL("time Timestamp")
> val newSpark = spark.newSession()
> newSpark.conf.set("spark.sql.ansi.enabled", "true")
> newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> ++
> |from_json({"time":"26/10/2015"})|
> ++
> |{2015-10-26 00:00...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33432) SQL parser should use active SQLConf

2020-11-12 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33432:
--
Summary: SQL parser should use active SQLConf  (was: DataType parser should 
use active SQLConf)

> SQL parser should use active SQLConf
> 
>
> Key: SPARK-33432
> URL: https://issues.apache.org/jira/browse/SPARK-33432
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lu Lu
>Priority: Major
>
> In ANSI mode, schema string parsing should fail if the schema uses ANSI 
> reserved keyword as attribute name:
> {code:scala}
> spark.conf.set("spark.sql.ansi.enabled", "true")
> spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> Cannot parse the data type: 
> no viable alternative at input 'time'(line 1, pos 0)
> == SQL ==
> time Timestamp
> ^^^
> {code}
> But this query may succeed in certain cases cause the DataType parser sticks 
> to the configs of the first created session in the current thread:
> {code:scala}
> DataType.fromDDL("time Timestamp")
> val newSpark = spark.newSession()
> newSpark.conf.set("spark.sql.ansi.enabled", "true")
> newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> ++
> |from_json({"time":"26/10/2015"})|
> ++
> |{2015-10-26 00:00...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33432) DataType parser should use active SQLConf

2020-11-12 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33432:
--
Description: 
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}
But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:
{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|{2015-10-26 00:00...|
++
{code}

  was:
In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}
But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:
{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|   {2015-10-26 00:00...|
++
{code}


> DataType parser should use active SQLConf
> -
>
> Key: SPARK-33432
> URL: https://issues.apache.org/jira/browse/SPARK-33432
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Lu Lu
>Priority: Major
>
> In ANSI mode, schema string parsing should fail if the schema uses ANSI 
> reserved keyword as attribute name:
> {code:scala}
> spark.conf.set("spark.sql.ansi.enabled", "true")
> spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> Cannot parse the data type: 
> no viable alternative at input 'time'(line 1, pos 0)
> == SQL ==
> time Timestamp
> ^^^
> {code}
> But this query may succeed in certain cases cause the DataType parser sticks 
> to the configs of the first created session in the current thread:
> {code:scala}
> DataType.fromDDL("time Timestamp")
> val newSpark = spark.newSession()
> newSpark.conf.set("spark.sql.ansi.enabled", "true")
> newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
> map('timestampFormat', 'dd/MM/'));""").show
> output:
> ++
> |from_json({"time":"26/10/2015"})|
> ++
> |{2015-10-26 00:00...|
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33432) DataType parser should use active SQLConf

2020-11-12 Thread Lu Lu (Jira)
Lu Lu created SPARK-33432:
-

 Summary: DataType parser should use active SQLConf
 Key: SPARK-33432
 URL: https://issues.apache.org/jira/browse/SPARK-33432
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Lu Lu


In ANSI mode, schema string parsing should fail if the schema uses ANSI 
reserved keyword as attribute name:
{code:scala}
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

Cannot parse the data type: 
no viable alternative at input 'time'(line 1, pos 0)

== SQL ==
time Timestamp
^^^
{code}
But this query may succeed in certain cases cause the DataType parser sticks to 
the configs of the first created session in the current thread:
{code:scala}
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', 
map('timestampFormat', 'dd/MM/'));""").show


output:

++
|from_json({"time":"26/10/2015"})|
++
|   {2015-10-26 00:00...|
++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33389) make internal classes of SparkSession always using active SQLConf

2020-11-08 Thread Lu Lu (Jira)
Lu Lu created SPARK-33389:
-

 Summary: make internal classes of SparkSession always using active 
SQLConf
 Key: SPARK-33389
 URL: https://issues.apache.org/jira/browse/SPARK-33389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lu Lu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33140) make all sub-class of Rule[QueryPlan] using SQLConf.get

2020-11-08 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33140:
--
Summary: make all sub-class of Rule[QueryPlan] using SQLConf.get  (was: 
make Analyzer rules using SQLConf.get)

> make all sub-class of Rule[QueryPlan] using SQLConf.get
> ---
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33008) Division by zero on divide-like operations returns incorrect result

2020-10-29 Thread Lu Lu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223341#comment-17223341
 ] 

Lu Lu commented on SPARK-33008:
---

Please assign this to me. [~cloud_fan]

> Division by zero on divide-like operations returns incorrect result
> ---
>
> Key: SPARK-33008
> URL: https://issues.apache.org/jira/browse/SPARK-33008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT 1/0;
> NULL
> Time taken: 3.002 seconds, Fetched 1 row(s)
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT 1/0;
> ERROR:  division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33008) Division by zero on divide-like operations returns incorrect result

2020-09-26 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33008:
--
Summary: Division by zero on divide-like operations returns incorrect 
result  (was: Throw exception on division by zero)

> Division by zero on divide-like operations returns incorrect result
> ---
>
> Key: SPARK-33008
> URL: https://issues.apache.org/jira/browse/SPARK-33008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT 1/0;
> NULL
> Time taken: 3.002 seconds, Fetched 1 row(s)
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT 1/0;
> ERROR:  division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33008) Throw exception on division by zero

2020-09-26 Thread Lu Lu (Jira)
Lu Lu created SPARK-33008:
-

 Summary: Throw exception on division by zero
 Key: SPARK-33008
 URL: https://issues.apache.org/jira/browse/SPARK-33008
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lu Lu


Spark SQL:
{code:java}
spark-sql> SELECT 1/0;
NULL
Time taken: 3.002 seconds, Fetched 1 row(s)
{code}
PostgreSQL:
{code:java}
postgres=# SELECT 1/0;
ERROR:  division by zero
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33008) Throw exception on division by zero

2020-09-26 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33008:
--
Description: 
Spark SQL:
{code:sql}
spark-sql> SELECT 1/0;
NULL
Time taken: 3.002 seconds, Fetched 1 row(s)
{code}
PostgreSQL:
{code:sql}
postgres=# SELECT 1/0;
ERROR:  division by zero
{code}

  was:
Spark SQL:
{code:sql}
spark-sql> SELECT 1/0;
NULL
Time taken: 3.002 seconds, Fetched 1 row(s)
{code:sql}
PostgreSQL:
{code:java}
postgres=# SELECT 1/0;
ERROR:  division by zero
{code}


> Throw exception on division by zero
> ---
>
> Key: SPARK-33008
> URL: https://issues.apache.org/jira/browse/SPARK-33008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT 1/0;
> NULL
> Time taken: 3.002 seconds, Fetched 1 row(s)
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT 1/0;
> ERROR:  division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33008) Throw exception on division by zero

2020-09-26 Thread Lu Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33008:
--
Description: 
Spark SQL:
{code:sql}
spark-sql> SELECT 1/0;
NULL
Time taken: 3.002 seconds, Fetched 1 row(s)
{code:sql}
PostgreSQL:
{code:java}
postgres=# SELECT 1/0;
ERROR:  division by zero
{code}

  was:
Spark SQL:
{code:java}
spark-sql> SELECT 1/0;
NULL
Time taken: 3.002 seconds, Fetched 1 row(s)
{code}
PostgreSQL:
{code:java}
postgres=# SELECT 1/0;
ERROR:  division by zero
{code}


> Throw exception on division by zero
> ---
>
> Key: SPARK-33008
> URL: https://issues.apache.org/jira/browse/SPARK-33008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lu Lu
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT 1/0;
> NULL
> Time taken: 3.002 seconds, Fetched 1 row(s)
> {code:sql}
> PostgreSQL:
> {code:java}
> postgres=# SELECT 1/0;
> ERROR:  division by zero
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4115) [GraphX] add overrided count for EdgeRDD

2014-10-28 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-4115:
-
Summary: [GraphX] add overrided count for EdgeRDD  (was: add overrided 
count for EdgeRDD)

> [GraphX] add overrided count for EdgeRDD
> 
>
> Key: SPARK-4115
> URL: https://issues.apache.org/jira/browse/SPARK-4115
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Lu Lu
>Priority: Minor
> Fix For: 1.1.1
>
>
> Add overrided count for edge counting of EdgeRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4115) add overrided count for EdgeRDD

2014-10-28 Thread Lu Lu (JIRA)
Lu Lu created SPARK-4115:


 Summary: add overrided count for EdgeRDD
 Key: SPARK-4115
 URL: https://issues.apache.org/jira/browse/SPARK-4115
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Lu Lu
Priority: Minor
 Fix For: 1.1.1


Add overrided count for edge counting of EdgeRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4109) Task.stageId is not been deserialized correctly

2014-10-27 Thread Lu Lu (JIRA)
Lu Lu created SPARK-4109:


 Summary: Task.stageId is not been deserialized correctly
 Key: SPARK-4109
 URL: https://issues.apache.org/jira/browse/SPARK-4109
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.0.0
Reporter: Lu Lu
 Fix For: 1.0.3


The two subclasses of Task, ShuffleMapTask and ResultTask, do not correctly 
deserialize stageId. Therefore, the accessing of TaskContext.stageId always 
returns zero value to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same parent RDD

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Description: 
if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
will submit redundant stages to compute and cache the common parent.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problems:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD immediately. But it cannot achieve 
data-locality for these tasks because that the block location information are 
unavailable when submiting the stages.

  was:
if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
will submit redundant stages to compute and cache the common parent.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD immediately. But it cannot achieve 
data-locality for these tasks because that the block location information are 
unavailable when submiting the stages.

Summary: Improve joinning RDDs that transformed from the same parent 
RDD  (was: Improve joinning RDDs that transformed from the same cached RDD)

> Improve joinning RDDs that transformed from the same parent RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
> will submit redundant stages to compute and cache the common parent.
> For example:
> {code}
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> {code}
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problems:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD immediately. But it cannot achieve 
> data-locality for these tasks because that the block location information are 
> unavailable when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Description: 
if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
will submit redundant stages to compute and cache the common parent.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD immediately. But it cannot achieve 
data-locality for these tasks because that the block location information are 
unavailable when submiting the stages.

  was:
if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
will submit redundant stages to compute and cache the common parent.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.


> Improve joinning RDDs that transformed from the same cached RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
> will submit redundant stages to compute and cache the common parent.
> For example:
> {code}
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> {code}
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problerm:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD immediately. But it cannot achieve 
> data-locality for these tasks because that the block location information are 
> unavailable when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Description: 
if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
will submit redundant stages to compute and cache the common parent.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.

  was:
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.


> Improve joinning RDDs that transformed from the same cached RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD, the DAGScheduler 
> will submit redundant stages to compute and cache the common parent.
> For example:
> {code}
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> {code}
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problerm:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD directly. But it cannot achieve 
> data-locality for the latter stage because that the block location 
> information are not known when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2827) Add DegreeDist function support

2014-08-04 Thread Lu Lu (JIRA)
Lu Lu created SPARK-2827:


 Summary: Add DegreeDist function support
 Key: SPARK-2827
 URL: https://issues.apache.org/jira/browse/SPARK-2827
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Lu Lu


Add degree distribution operators in GraphOps for GraphX.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Description: 
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

{code}
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
{code}

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.

  was:
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

```
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
```

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.


> Improve joinning RDDs that transformed from the same cached RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD a, the 
> DAGScheduler will submit redundant stages to compute and cache the RDD a.
> For example:
> {code}
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> {code}
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problerm:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD directly. But it cannot achieve 
> data-locality for the latter stage because that the block location 
> information are not known when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Description: 
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

```
val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count
```

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.

  was:
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.


> Improve joinning RDDs that transformed from the same cached RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD a, the 
> DAGScheduler will submit redundant stages to compute and cache the RDD a.
> For example:
> ```
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> ```
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problerm:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD directly. But it cannot achieve 
> data-locality for the latter stage because that the block location 
> information are not known when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-08-04 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2823:
-

Description: 
If the users set “spark.default.parallelism” and the value is different with 
the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:

14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception 
- job: 1
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
97)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:272)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:274)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:274)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
at 
org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

  was:
If the users set “spark.default.parallelism” and the value is different with 
the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:

14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception 
- job: 1
.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.sp

[jira] [Created] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-08-03 Thread Lu Lu (JIRA)
Lu Lu created SPARK-2823:


 Summary: GraphX jobs throw IllegalArgumentException
 Key: SPARK-2823
 URL: https://issues.apache.org/jira/browse/SPARK-2823
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lu Lu


If the users set “spark.default.parallelism” and the value is different with 
the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:

14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to exception 
- job: 1
.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
97)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:272)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:274)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:274)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
cala:269)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
at 
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
at 
org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-03 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-2818:
-

Component/s: Spark Core
Description: 
if the joinning RDDs are originating from a same cached RDD a, the DAGScheduler 
will submit redundant stages to compute and cache the RDD a.
For example:

val edges = sc.textFile(...).cache()
val bigSrc = edges.groupByKey().filter(...)
val reversed = edges.map(edge => (edge._2, edge._1))
val bigDst = reversed.groupByKey().filter(...)
bigSrc.join(bigDst).count

The final count action will trigger two stages both to compute the edges RDD. 
It will result to two performance problerm:
(1) if the resources are sufficient, these two stages will be running 
concurrently and read the same HDFS file at the same time.
(2) if the two stages run one by one, the tasks of the latter stage can read 
the cached blocks of the edges RDD directly. But it cannot achieve 
data-locality for the latter stage because that the block location information 
are not known when submiting the stages.

> Improve joinning RDDs that transformed from the same cached RDD
> ---
>
> Key: SPARK-2818
> URL: https://issues.apache.org/jira/browse/SPARK-2818
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Lu Lu
>
> if the joinning RDDs are originating from a same cached RDD a, the 
> DAGScheduler will submit redundant stages to compute and cache the RDD a.
> For example:
> val edges = sc.textFile(...).cache()
> val bigSrc = edges.groupByKey().filter(...)
> val reversed = edges.map(edge => (edge._2, edge._1))
> val bigDst = reversed.groupByKey().filter(...)
> bigSrc.join(bigDst).count
> The final count action will trigger two stages both to compute the edges RDD. 
> It will result to two performance problerm:
> (1) if the resources are sufficient, these two stages will be running 
> concurrently and read the same HDFS file at the same time.
> (2) if the two stages run one by one, the tasks of the latter stage can read 
> the cached blocks of the edges RDD directly. But it cannot achieve 
> data-locality for the latter stage because that the block location 
> information are not known when submiting the stages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2818) Improve joinning RDDs that transformed from the same cached RDD

2014-08-03 Thread Lu Lu (JIRA)
Lu Lu created SPARK-2818:


 Summary: Improve joinning RDDs that transformed from the same 
cached RDD
 Key: SPARK-2818
 URL: https://issues.apache.org/jira/browse/SPARK-2818
 Project: Spark
  Issue Type: Improvement
Reporter: Lu Lu






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org