[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL

2018-05-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494769#comment-16494769
 ] 

Xiao Li commented on SPARK-24395:
-

I think Oracle returns a different answer. We should fix them. 

> Fix Behavior of NOT IN with Literals Containing NULL
> 
>
> Key: SPARK-24395
> URL: https://issues.apache.org/jira/browse/SPARK-24395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Spark does not return the correct answer when evaluating NOT IN in some 
> cases. For example:
> {code:java}
> CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
>   (null, null)
>   AS m(a, b);
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;{code}
> According to the semantics of null-aware anti-join, this should return no 
> rows. However, it actually returns the row {{NULL NULL}}. This was found by 
> inspecting the unit tests added for SPARK-24381 
> ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).]
> *Acceptance Criteria*:
>  * We should be able to add the following test cases back to 
> {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}:
> {code:java}
>   -- Case 2
>   -- (subquery contains a row with null in all columns -> row not returned)
> SELECT *
> FROM   m
> WHERE  (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1;
>   -- Case 3
>   -- (probe-side columns are all null -> row not returned)
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL -- Matches only (null, null)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;
>   -- Case 4
>   -- (one column null, other column matches a row in the subquery result -> 
> row not returned)
> SELECT *
> FROM   m
> WHERE  b = 1.0 -- Matches (null, 1.0)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1; 
> {code}
>  
> cc [~smilegator] [~juliuszsompolski]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24423:

Description: 
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option.   
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
 Users are not allowed to specify query and dbtable at the same time. 

  was:
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
  
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
  
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Users are not allowed to specify query and dbtable at the same time. 


> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option.   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24423:

Description: 
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
{code} 
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", "dbName.tableName")
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
  
{code} 
 val query = """ (select * from tableName limit 10) as tmp """
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*dbtable*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
  
{code} 
 val query = """select * from tableName limit 10"""
 val jdbcDf = spark.read
   .format("jdbc")
   .option("*{color:#ff}query{color}*", query)
   .options(jdbcCredentials: Map)
   .load()
{code} 
  
 Users are not allowed to specify query and dbtable at the same time. 

  was:
Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
 
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", "dbName.tableName")
  .options(jdbcCredentials: Map)
  .load()
 
Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
 
val query = """ (select * from tableName limit 10) as tmp """
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", query)
  .options(jdbcCredentials: Map)
  .load()
 
However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
 
val query = """select * from tableName limit 10"""
val jdbcDf = spark.read
  .format("jdbc")
  .option("*{color:#ff}query{color}*", query)
  .options(jdbcCredentials: Map)
  .load()
 
Users are not allowed to specify query and dbtable at the same time. 


> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
> {code} 
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", "dbName.tableName")
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option. 
>   
> {code} 
>  val query = """ (select * from tableName limit 10) as tmp """
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*dbtable*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
>   
> {code} 
>  val query = """select * from tableName limit 10"""
>  val jdbcDf = spark.read
>    .format("jdbc")
>    .option("*{color:#ff}query{color}*", query)
>    .options(jdbcCredentials: Map)
>    .load()
> {code} 
>   
>  Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494754#comment-16494754
 ] 

Xiao Li commented on SPARK-24424:
-

Also cc [~dkbiswal] 

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494753#comment-16494753
 ] 

Dilip Biswal commented on SPARK-24423:
--

[~smilegator] Thanks Sean for pinging me. I would like to take a look at this 
one.

> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
>  
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*dbtable*", "dbName.tableName")
>   .options(jdbcCredentials: Map)
>   .load()
>  
> Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option. 
>  
> val query = """ (select * from tableName limit 10) as tmp """
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*dbtable*", query)
>   .options(jdbcCredentials: Map)
>   .load()
>  
> However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
>  
> val query = """select * from tableName limit 10"""
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*{color:#ff}query{color}*", query)
>   .options(jdbcCredentials: Map)
>   .load()
>  
> Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Description: 
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

Note, we should not break the existing syntax. The parser changes should be like
{code:sql}
group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 

  was:
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like
{code:SQL}

group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 


> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group

[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Description: 
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like
{code:SQL}

group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 

  was:
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like
{code}
group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 


> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> The parser changes should be like
> {code:SQL}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-s

[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Description: 
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like
{code}
group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 

  was:
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like

group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
 '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
 .-,--. +--WITH CUBE--+
 V | +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

.-,--. 
 V | 
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
 | .-,--. |
 | V | |
 V '-(--expression---+-)-' |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
 +-CUBE--(--grouping-expression-list--)---+ 
 '-GROUPING SETS--(--grouping-set-expressions--)--'


> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> The parser changes should be like
> {code}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--. 

[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Description: 
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like

group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
 '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
 .-,--. +--WITH CUBE--+
 V | +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

.-,--. 
 V | 
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
 | .-,--. |
 | V | |
 V '-(--expression---+-)-' |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
 +-CUBE--(--grouping-expression-list--)---+ 
 '-GROUPING SETS--(--grouping-set-expressions--)--'

  was:
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like

 
group-by-expressions
 
>>-GROUP BY+-hive-sql-group-by-expressions-+---><
               *{color:#ff}'-ansi-sql-grouping-set-expressions-'{color}*    
 
hive-sql-group-by-expressions
 
                        '--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V                |   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><
 
grouping-expressions-list
 
   .-,--.  
   V                |  
>>---+-expression-+-+--><
 
 
grouping-set-expressions
 
    .-,.
    |      .-,--.      |
    |      V                |      |
    V '-(--expression---+-)-'  |
>>+-expression--+--+-><
 
 
{color:#ff}ansi-sql-grouping-set-expressions{color}
{color:#ff} {color}
{color:#ff}>>-+-ROLLUP--(--grouping-expression-list--)-+--><{color}
{color:#ff}   +-CUBE--(--grouping-expression-list--)---+   {color}
{color:#ff}   '-GROUPING SETS--(--grouping-set-expressions--)--' {color}


> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> The parser changes should be like
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>  '-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expression

[jira] [Created] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-05-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24424:
---

 Summary: Support ANSI-SQL compliant syntax for ROLLUP, CUBE and 
GROUPING SET
 Key: SPARK-24424
 URL: https://issues.apache.org/jira/browse/SPARK-24424
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

The parser changes should be like

 
group-by-expressions
 
>>-GROUP BY+-hive-sql-group-by-expressions-+---><
               *{color:#ff}'-ansi-sql-grouping-set-expressions-'{color}*    
 
hive-sql-group-by-expressions
 
                        '--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V                |   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><
 
grouping-expressions-list
 
   .-,--.  
   V                |  
>>---+-expression-+-+--><
 
 
grouping-set-expressions
 
    .-,.
    |      .-,--.      |
    |      V                |      |
    V '-(--expression---+-)-'  |
>>+-expression--+--+-><
 
 
{color:#ff}ansi-sql-grouping-set-expressions{color}
{color:#ff} {color}
{color:#ff}>>-+-ROLLUP--(--grouping-expression-list--)-+--><{color}
{color:#ff}   +-CUBE--(--grouping-expression-list--)---+   {color}
{color:#ff}   '-GROUPING SETS--(--grouping-set-expressions--)--' {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24331) Add arrays_overlap / array_repeat / map_entries

2018-05-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-24331.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> Add arrays_overlap / array_repeat / map_entries  
> -
>
> Key: SPARK-24331
> URL: https://issues.apache.org/jira/browse/SPARK-24331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add SparkR equivalent to:
>  * arrays_overlap - SPARK-23922
>  * array_repeat - SPARK-23925
>  * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24331) Add arrays_overlap / array_repeat / map_entries

2018-05-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-24331:


Assignee: Marek Novotny

> Add arrays_overlap / array_repeat / map_entries  
> -
>
> Key: SPARK-24331
> URL: https://issues.apache.org/jira/browse/SPARK-24331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add SparkR equivalent to:
>  * arrays_overlap - SPARK-23922
>  * array_repeat - SPARK-23925
>  * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494739#comment-16494739
 ] 

Xiao Li commented on SPARK-24423:
-

cc [~dkbiswal] Are your team interested in this task?

> Add a new option `query` for JDBC sources
> -
>
> Key: SPARK-24423
> URL: https://issues.apache.org/jira/browse/SPARK-24423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our JDBC connector provides the option `dbtable` for users to 
> specify the to-be-loaded JDBC source table. 
>  
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*dbtable*", "dbName.tableName")
>   .options(jdbcCredentials: Map)
>   .load()
>  
> Normally, users do not fetch the whole JDBC table due to the poor 
> performance/throughput of JDBC. Thus, they normally just fetch a small set of 
> tables. For advanced users, they can pass a subquery as the option. 
>  
> val query = """ (select * from tableName limit 10) as tmp """
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*dbtable*", query)
>   .options(jdbcCredentials: Map)
>   .load()
>  
> However, this is straightforward to end users. We should simply allow users 
> to specify the query by a new option `query`. We will handle the complexity 
> for them. 
>  
> val query = """select * from tableName limit 10"""
> val jdbcDf = spark.read
>   .format("jdbc")
>   .option("*{color:#ff}query{color}*", query)
>   .options(jdbcCredentials: Map)
>   .load()
>  
> Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24423) Add a new option `query` for JDBC sources

2018-05-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24423:
---

 Summary: Add a new option `query` for JDBC sources
 Key: SPARK-24423
 URL: https://issues.apache.org/jira/browse/SPARK-24423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


Currently, our JDBC connector provides the option `dbtable` for users to 
specify the to-be-loaded JDBC source table. 
 
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", "dbName.tableName")
  .options(jdbcCredentials: Map)
  .load()
 
Normally, users do not fetch the whole JDBC table due to the poor 
performance/throughput of JDBC. Thus, they normally just fetch a small set of 
tables. For advanced users, they can pass a subquery as the option. 
 
val query = """ (select * from tableName limit 10) as tmp """
val jdbcDf = spark.read
  .format("jdbc")
  .option("*dbtable*", query)
  .options(jdbcCredentials: Map)
  .load()
 
However, this is straightforward to end users. We should simply allow users to 
specify the query by a new option `query`. We will handle the complexity for 
them. 
 
val query = """select * from tableName limit 10"""
val jdbcDf = spark.read
  .format("jdbc")
  .option("*{color:#ff}query{color}*", query)
  .options(jdbcCredentials: Map)
  .load()
 
Users are not allowed to specify query and dbtable at the same time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-29 Thread Vikram Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494732#comment-16494732
 ] 

Vikram Agrawal commented on SPARK-18165:


[~mail2sivan...@gmail.com] - This library has been tested and developed against 
SPARK-2.2.X. I understand that you are trying it against SPARK-2.3.0. 

Can you please raise an issue in the kinesis-sql repo 
(https://github.com/qubole/kinesis-sql) and we can have a further discussion 
there.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24365) Add data source write benchmark

2018-05-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24365.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21409
[https://github.com/apache/spark/pull/21409]

> Add data source write benchmark
> ---
>
> Key: SPARK-24365
> URL: https://issues.apache.org/jira/browse/SPARK-24365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Add data source write benchmark. So that it would be easier to measure the 
> writer performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24365) Add data source write benchmark

2018-05-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24365:
---

Assignee: Gengliang Wang

> Add data source write benchmark
> ---
>
> Key: SPARK-24365
> URL: https://issues.apache.org/jira/browse/SPARK-24365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Add data source write benchmark. So that it would be easier to measure the 
> writer performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24403) reuse r worker

2018-05-29 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494707#comment-16494707
 ] 

Felix Cheung edited comment on SPARK-24403 at 5/30/18 5:38 AM:
---

Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to, R UDF, is a fairly different code 
path, and that might be a different issue all together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 


was (Author: felixcheung):
Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to R UDF might be a different issue all 
together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-05-29 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494708#comment-16494708
 ] 

Felix Cheung commented on SPARK-23650:
--

sorry, I really don't have time/resource to investigate this for now.

hopefully later...

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: packageReload.txt, read_model_in_udf.txt, 
> sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24420) Upgrade ASM to 6.x to support JDK9+

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24420:


Assignee: DB Tsai  (was: Apache Spark)

> Upgrade ASM to 6.x to support JDK9+
> ---
>
> Key: SPARK-24420
> URL: https://issues.apache.org/jira/browse/SPARK-24420
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24403) reuse r worker

2018-05-29 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494707#comment-16494707
 ] 

Felix Cheung commented on SPARK-24403:
--

Reuse worker (daemon process) is actually supported and the default for SparkR.

The specific use case you have linked to R UDF might be a different issue all 
together.

Please refer back to the original issue  - don't open a new JIRA. Thanks. 

> reuse r worker
> --
>
> Key: SPARK-24403
> URL: https://issues.apache.org/jira/browse/SPARK-24403
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Deepansh
>Priority: Major
>  Labels: sparkR
>
> Currently, SparkR doesn't support reuse of its workers, so broadcast and 
> closure are transferred to workers each time. Can we add the idea of python 
> worker reuse to SparkR also, to enhance its performance?
> performance issues reference 
> [https://issues.apache.org/jira/browse/SPARK-23650]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24420) Upgrade ASM to 6.x to support JDK9+

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24420:


Assignee: Apache Spark  (was: DB Tsai)

> Upgrade ASM to 6.x to support JDK9+
> ---
>
> Key: SPARK-24420
> URL: https://issues.apache.org/jira/browse/SPARK-24420
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24420) Upgrade ASM to 6.x to support JDK9+

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494706#comment-16494706
 ] 

Apache Spark commented on SPARK-24420:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21459

> Upgrade ASM to 6.x to support JDK9+
> ---
>
> Key: SPARK-24420
> URL: https://issues.apache.org/jira/browse/SPARK-24420
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24419:


Assignee: DB Tsai  (was: Apache Spark)

> Upgrade SBT to 0.13.17 with Scala 2.10.7
> 
>
> Key: SPARK-24419
> URL: https://issues.apache.org/jira/browse/SPARK-24419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24419:


Assignee: Apache Spark  (was: DB Tsai)

> Upgrade SBT to 0.13.17 with Scala 2.10.7
> 
>
> Key: SPARK-24419
> URL: https://issues.apache.org/jira/browse/SPARK-24419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494705#comment-16494705
 ] 

Apache Spark commented on SPARK-24419:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21458

> Upgrade SBT to 0.13.17 with Scala 2.10.7
> 
>
> Key: SPARK-24419
> URL: https://issues.apache.org/jira/browse/SPARK-24419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-29 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494701#comment-16494701
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] `setting completeString to no-op` what you mean at this? After I 
comment out in the code the generating of the string I don't get the OOM.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2018-05-29 Thread Jongyoul Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-24120:
-
Fix Version/s: 0.8.1
   0.9.0

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24120) Show `Jobs` page when `jobId` is missing

2018-05-29 Thread Jongyoul Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-24120:
-
Fix Version/s: (was: 0.9.0)
   (was: 0.8.1)

> Show `Jobs` page when `jobId` is missing
> 
>
> Key: SPARK-24120
> URL: https://issues.apache.org/jira/browse/SPARK-24120
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Priority: Minor
>
> For now, users try to connect {{job}} page without {{jobid}}, Spark UI shows 
> only error page. It's not incorrect but helpless to users. It would be better 
> to redirect to `jobs` page to select proper job. This, actually, happens when 
> users use yarn mode. Because of yarn's bug(YARN-6615), some parameters aren't 
> passed to Spark's driver UI with now the latest version of Yarn. It's also 
> mentioned at SPARK-20772.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24409) exception when sending large list in filter(col(x).isin(list))

2018-05-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494647#comment-16494647
 ] 

Liang-Chi Hsieh commented on SPARK-24409:
-

Seems you use AWS Glue Data Catalog as the Metastore for Hive. And the too long 
partition filtering expressions cause this exception.

> exception when sending large list in filter(col(x).isin(list))
> --
>
> Key: SPARK-24409
> URL: https://issues.apache.org/jira/browse/SPARK-24409
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Janet Levin
>Priority: Major
>
> This is the error we get:
>  
> /mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/pyspark.zip/pyspark/sql/dataframe.py",
>  line 88, in rdd
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/py4j-0.10.6-src.zip/py4j/protocol.py",
>  line 320, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o605.javaToPython.
> : java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:741)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:655)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:653)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:653)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1218)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1211)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1211)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:61)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.sc

[jira] [Commented] (SPARK-8659) Spark SQL Thrift Server does NOT honour hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

2018-05-29 Thread L (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494620#comment-16494620
 ] 

L commented on SPARK-8659:
--

I meet the problems,too.And i choose the sentry as the authorisation for Spark 
Thrift server but it does not work.Is there any solution for this?

> Spark SQL Thrift Server does NOT honour 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  
> ---
>
> Key: SPARK-8659
> URL: https://issues.apache.org/jira/browse/SPARK-8659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: Premchandra Preetham Kukillaya
>Priority: Major
>
> It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
> Hive's security  feature SQL based authorisation is not working. It ignores 
> the security settings passed through the command line. The arguments for 
> command line is given below for reference
> The problem is user X can do select on table belonging to user Y, though 
> permission for table is explicitly defined and its a data security risk.
> I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed 
> to Spark SQL Thrift Server.
> ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
> hostname.compute.amazonaws.com --hiveconf 
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>  --hiveconf 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  --hiveconf hive.server2.enable.doAs=false --hiveconf 
> hive.security.authorization.enabled=true --hiveconf 
> javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
>  --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
> --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
> javax.jdo.option.ConnectionPassword=hive



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24409) exception when sending large list in filter(col(x).isin(list))

2018-05-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494590#comment-16494590
 ] 

Hyukjin Kwon commented on SPARK-24409:
--

Mind sharing a reproducer if you already have?

> exception when sending large list in filter(col(x).isin(list))
> --
>
> Key: SPARK-24409
> URL: https://issues.apache.org/jira/browse/SPARK-24409
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Janet Levin
>Priority: Major
>
> This is the error we get:
>  
> /mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/pyspark.zip/pyspark/sql/dataframe.py",
>  line 88, in rdd
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/py4j-0.10.6-src.zip/py4j/java_gateway.py",
>  line 1160, in __call__
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526466002571_8701/container_1526466002571_8701_01_01/py4j-0.10.6-src.zip/py4j/protocol.py",
>  line 320, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o605.javaToPython.
> : java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:741)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:655)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:653)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:653)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1218)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1211)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1211)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:61)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$

[jira] [Resolved] (SPARK-24376) compiling spark with scala-2.10 should use the -P parameter instead of -D

2018-05-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24376.
--
Resolution: Won't Fix

> compiling spark with scala-2.10 should use the -P parameter instead of -D
> -
>
> Key: SPARK-24376
> URL: https://issues.apache.org/jira/browse/SPARK-24376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Yu Wang
>Priority: Minor
> Attachments: SPARK-24376_1.patch
>
>
> [INFO] Scanning for projects...
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
> [ERROR] 'dependencyManagement.dependencies.dependency.groupId' for 
> ${jline.groupid}:jline:jar with value '${jline.groupid}' does not match a 
> valid id pattern. @ line 1875, column 18
>  @ 
> [ERROR] The build could not read 1 project -> [Help 1]
> [ERROR] 
> [ERROR] The project org.apache.spark:spark-parent_2.10:2.2.0 
> (/data4/wangyu/bch/Spark/pom.xml) has 1 error
> [ERROR] 'dependencyManagement.dependencies.dependency.groupId' for 
> ${jline.groupid}:jline:jar with value '${jline.groupid}' does not match a 
> valid id pattern. @ line 1875, column 18
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24417) Build and Run Spark on JDK9+

2018-05-29 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24417:

Description: 
This is an umbrella JIRA for Apache Spark to support JDK9+

As Java 8 is going way soon, JDK11 will be LTS and GA this Sep, and companies 
are testing JDK9 or JDK10 to prepare for JDK11. It's best to start the 
traumatic process of supporting newer version of JDK in Apache Spark as a 
background activity. 

The subtasks are what have to be done to support JDK9+.

  was:
This is an umbrella JIRA for Apache Spark to support JDK9+

As JDK11 will be LTS and GA this Sep, and companies are testing JDK9 or JDK10 
to prepare for JDK11, we should start the process of supporting JDK9+ in Apache 
Spark.


> Build and Run Spark on JDK9+
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> This is an umbrella JIRA for Apache Spark to support JDK9+
> As Java 8 is going way soon, JDK11 will be LTS and GA this Sep, and companies 
> are testing JDK9 or JDK10 to prepare for JDK11. It's best to start the 
> traumatic process of supporting newer version of JDK in Apache Spark as a 
> background activity. 
> The subtasks are what have to be done to support JDK9+.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24422) Add JDK9+ in our Jenkins' build servers

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24422:
---

 Summary: Add JDK9+ in our Jenkins' build servers
 Key: SPARK-24422
 URL: https://issues.apache.org/jira/browse/SPARK-24422
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-29 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24419:
---

Assignee: DB Tsai

> Upgrade SBT to 0.13.17 with Scala 2.10.7
> 
>
> Key: SPARK-24419
> URL: https://issues.apache.org/jira/browse/SPARK-24419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24420) Upgrade ASM to 6.x to support JDK9+

2018-05-29 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24420:
---

Assignee: DB Tsai

> Upgrade ASM to 6.x to support JDK9+
> ---
>
> Key: SPARK-24420
> URL: https://issues.apache.org/jira/browse/SPARK-24420
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24418) Upgrade to Scala 2.11.12

2018-05-29 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24418:
---

Assignee: DB Tsai

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24421) sun.misc.Unsafe in JDK9+

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24421:
---

 Summary: sun.misc.Unsafe in JDK9+
 Key: SPARK-24421
 URL: https://issues.apache.org/jira/browse/SPARK-24421
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.3.0
Reporter: DB Tsai


Many internal APIs such as unsafe are encapsulated in JDK9+, see 
http://openjdk.java.net/jeps/260 for detail.

To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
declaration:

{code:java}
module java9unsafe {
requires jdk.unsupported;
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-29 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494513#comment-16494513
 ] 

Ruben Berenguel commented on SPARK-23904:
-

[~igreenfi] after a few more tries at reproducing, I'm not getting OOM due to 
the query plan string being too large, but just a tree blow up in the catalyst 
analysis. Are you able to reproduce the string plan OOM with your linked code 
sample on 2.3, and setting completeString to no-op makes the OOM go away?

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24420) Upgrade ASM to 6.x to support JDK9+

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24420:
---

 Summary: Upgrade ASM to 6.x to support JDK9+
 Key: SPARK-24420
 URL: https://issues.apache.org/jira/browse/SPARK-24420
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.3.0
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-29 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494462#comment-16494462
 ] 

Ruben Berenguel commented on SPARK-23904:
-

Finally, managed to reproduce (takes a long while, even reducing driver and 
executor memory). Working on it!

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24419) Upgrade SBT to 0.13.17 with Scala 2.10.7

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24419:
---

 Summary: Upgrade SBT to 0.13.17 with Scala 2.10.7
 Key: SPARK-24419
 URL: https://issues.apache.org/jira/browse/SPARK-24419
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.3.0
Reporter: DB Tsai
 Fix For: 2.4.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24418) Upgrade to Scala 2.11.12

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24418:
---

 Summary: Upgrade to Scala 2.11.12
 Key: SPARK-24418
 URL: https://issues.apache.org/jira/browse/SPARK-24418
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.3.0
Reporter: DB Tsai
 Fix For: 2.4.0


Scala 2.11.12+ will support JDK9+. However, this is not goin to be a simple 
version bump. 

*loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
initialize the Spark before REPL sees any files.

Issue filed in Scala community.
https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24417) Build and Run Spark on JDK9+

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24417:
---

 Summary: Build and Run Spark on JDK9+
 Key: SPARK-24417
 URL: https://issues.apache.org/jira/browse/SPARK-24417
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 2.3.0
Reporter: DB Tsai
 Fix For: 2.4.0


This is an umbrella JIRA for Apache Spark to support JDK9+

As JDK11 will be LTS and GA this Sep, and companies are testing JDK9 or JDK10 
to prepare for JDK11, we should start the process of supporting JDK9+ in Apache 
Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-29 Thread sivanesh selvanataraj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494403#comment-16494403
 ] 

sivanesh selvanataraj commented on SPARK-18165:
---

[~itsvikramagr] I got this error when i execute
{{ kinesis .selectExpr("CAST(data AS STRING)").as[(String)] 
.groupBy("data").count() .writeStream .format("console") 
.outputMode("complete") .start() .awaitTermination()}}


ERROR MicroBatchExecution:91 - Query [id = 
52e761d3-02f7-4352-9c8a-d1f59d7938bb, runId = 
f82e0f00-9c88-4d52-ae26-9ff54d8267c3] terminated with error
java.lang.AbstractMethodError
 at 
org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
 at 
org.apache.spark.sql.kinesis.KinesisSource.initializeLogIfNecessary(KinesisSource.scala:51)
 at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
 at org.apache.spark.sql.kinesis.KinesisSource.log(KinesisSource.scala:51)
 at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
 at org.apache.spark.sql.kinesis.KinesisSource.logInfo(KinesisSource.scala:51)
 at 
org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:211)
 at 
org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:105)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$5$$anonfun$apply$5.apply(MicroBatchExecution.scala:268)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$5$$anonfun$apply$5.apply(MicroBatchExecution.scala:268)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$5.apply(MicroBatchExecution.scala:267)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$5.apply(MicroBatchExecution.scala:264)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.AbstractTraversable.map(Traversable.scala:104)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:264)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:237)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:124)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
 at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494391#comment-16494391
 ] 

Liang-Chi Hsieh commented on SPARK-24410:
-

[~cloud_fan] Thanks for pinging me. I'll look into this.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

2018-05-29 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24413.
---
Resolution: Duplicate

> Executor Blacklisting shouldn't immediately fail the application if dynamic 
> allocation is enabled and no active executors
> -
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you 
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
> doesn't matter in this case, can be on or off), if you have a task fail that 
> results in the 1 executor you have getting blacklisted, then your entire 
> application will fail.  The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one 
> task is the long tail and if it happens to hit a bad node that would 
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately 
> fail but it should let dynamic allocation try to get more executors. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24414:


Assignee: Apache Spark

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24414:


Assignee: (was: Apache Spark)

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494300#comment-16494300
 ] 

Apache Spark commented on SPARK-24414:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21457

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24392) Mark pandas_udf as Experimental

2018-05-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-24392:
-
Fix Version/s: 2.4.0

> Mark pandas_udf as Experimental
> ---
>
> Key: SPARK-24392
> URL: https://issues.apache.org/jira/browse/SPARK-24392
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> This functionality is still evolving and has introduced some bugs .  It was 
> an oversight to not mark it as experimental before it was released in 2.3.0.  
> Not sure if it is a good idea to change this after the fact, but I'm opening 
> this to continue discussion from 
> https://github.com/apache/spark/pull/21427#issuecomment-391967423



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494279#comment-16494279
 ] 

Thomas Graves commented on SPARK-24413:
---

thanks for linking those we can just dup this to SPARK-22148

> Executor Blacklisting shouldn't immediately fail the application if dynamic 
> allocation is enabled and no active executors
> -
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you 
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
> doesn't matter in this case, can be on or off), if you have a task fail that 
> results in the 1 executor you have getting blacklisted, then your entire 
> application will fail.  The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one 
> task is the long tail and if it happens to hit a bad node that would 
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately 
> fail but it should let dynamic allocation try to get more executors. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494276#comment-16494276
 ] 

Marcelo Vanzin commented on SPARK-24414:


After a quick look at the code they don't seem related.

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494265#comment-16494265
 ] 

Thomas Graves commented on SPARK-24414:
---

also just an fyi I also filed SPARK-24415, not sure if they are related as I 
haven't dug into that one yet.  

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494239#comment-16494239
 ] 

Marcelo Vanzin commented on SPARK-24414:


Yeah that's the direction I ended up in. Taking the chance to clean up the code 
a bit around this area...

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494237#comment-16494237
 ] 

Thomas Graves commented on SPARK-24414:
---

I am looking to see if we can just return an empty table in the case the tasks 
aren't initialized yet. If you get to it first thats fine or had something else 
in mind 

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-05-29 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24416:


 Summary: Update configuration definition for 
spark.blacklist.killBlacklistedExecutors
 Key: SPARK-24416
 URL: https://issues.apache.org/jira/browse/SPARK-24416
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sanket Reddy


spark.blacklist.killBlacklistedExecutors is defined as 

(Experimental) If set to "true", allow Spark to automatically kill, and attempt 
to re-create, executors when they are blacklisted. Note that, when an entire 
node is added to the blacklist, all of the executors on that node will be 
killed.

I presume the killing of blacklisted executors only happens after the stage 
completes successfully and all tasks have completed or on fetch failures 
(updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
confusing because the definition states that the executor will be attempted to 
be recreated as soon as it is blacklisted. This is not true while the stage is 
in progress and an executor is blacklisted, it will not attempt to cleanup 
until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494206#comment-16494206
 ] 

Marcelo Vanzin commented on SPARK-24414:


I was going to take a stab at this next.

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494199#comment-16494199
 ] 

Thomas Graves commented on SPARK-24414:
---

looks like this was broken by SPARK-23147, so we probably need to find a 
different solution.

[~vanzin] [~jerryshao]

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24356:


Assignee: Apache Spark

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494184#comment-16494184
 ] 

Apache Spark commented on SPARK-24356:
--

User 'countmdm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21456

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

2018-05-29 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494183#comment-16494183
 ] 

Imran Rashid commented on SPARK-24413:
--

yeah I agree about this.  I linked two related jiras that are very close.  I 
put down some thoughts earlier on those jiras for good ways to do this, but 
haven't had time to work on it

> Executor Blacklisting shouldn't immediately fail the application if dynamic 
> allocation is enabled and no active executors
> -
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you 
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
> doesn't matter in this case, can be on or off), if you have a task fail that 
> results in the 1 executor you have getting blacklisted, then your entire 
> application will fail.  The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one 
> task is the long tail and if it happens to hit a bad node that would 
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately 
> fail but it should let dynamic allocation try to get more executors. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24356:


Assignee: (was: Apache Spark)

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24395) Fix Behavior of NOT IN with Literals Containing NULL

2018-05-29 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494181#comment-16494181
 ] 

Juliusz Sompolski commented on SPARK-24395:
---

The question is whether the literals should be treated as structs, or unpacked?

If like structs, then the current behavior is correct, I think.

But when a similar query is IN / NOT IN subquery, it is currently treated as if 
the left hand side was unpacked into independent columns.

cc [~mgaido] [~hvanhovell]

> Fix Behavior of NOT IN with Literals Containing NULL
> 
>
> Key: SPARK-24395
> URL: https://issues.apache.org/jira/browse/SPARK-24395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Miles Yucht
>Priority: Major
>
> Spark does not return the correct answer when evaluating NOT IN in some 
> cases. For example:
> {code:java}
> CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
>   (null, null)
>   AS m(a, b);
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;{code}
> According to the semantics of null-aware anti-join, this should return no 
> rows. However, it actually returns the row {{NULL NULL}}. This was found by 
> inspecting the unit tests added for SPARK-24381 
> ([https://github.com/apache/spark/pull/21425#pullrequestreview-123421822).]
> *Acceptance Criteria*:
>  * We should be able to add the following test cases back to 
> {{subquery/in-subquery/not-in-unit-test-multi-column-literal.sql}}:
> {code:java}
>   -- Case 2
>   -- (subquery contains a row with null in all columns -> row not returned)
> SELECT *
> FROM   m
> WHERE  (a, b) NOT IN ((CAST (null AS INT), CAST (null AS DECIMAL(2, 1;
>   -- Case 3
>   -- (probe-side columns are all null -> row not returned)
> SELECT *
> FROM   m
> WHERE  a IS NULL AND b IS NULL -- Matches only (null, null)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1;
>   -- Case 4
>   -- (one column null, other column matches a row in the subquery result -> 
> row not returned)
> SELECT *
> FROM   m
> WHERE  b = 1.0 -- Matches (null, 1.0)
>AND (a, b) NOT IN ((0, 1.0), (2, 3.0), (4, CAST(null AS DECIMAL(2, 
> 1; 
> {code}
>  
> cc [~smilegator] [~juliuszsompolski]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494152#comment-16494152
 ] 

Apache Spark commented on SPARK-24093:
--

User 'merlintang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21455

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-29 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494154#comment-16494154
 ] 

Mingjie Tang commented on SPARK-24093:
--

I made a PR for this 
https://github.com/apache/spark/pull/21455


> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24093:


Assignee: Apache Spark

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Assignee: Apache Spark
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24093) Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24093:


Assignee: (was: Apache Spark)

> Make some fields of KafkaStreamWriter/InternalRowMicroBatchWriter visible to 
> outside of the classes
> ---
>
> Key: SPARK-24093
> URL: https://issues.apache.org/jira/browse/SPARK-24093
> Project: Spark
>  Issue Type: Wish
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Weiqing Yang
>Priority: Minor
>
> To make third parties able to get the information of streaming writer, for 
> example, the information of "writer" and "topic" which streaming data are 
> written into, this jira is created to make relevant fields of 
> KafkaStreamWriter and InternalRowMicroBatchWriter visible to outside of the 
> classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-29 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Description: 
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.    Note I tested with master branch to 
verify its not fixed.

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map

{ x => | if (SparkEnv.get.executorId.toInt >= 1 && 
SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad executor") 
| else (x % 3, x) | }

.reduceByKey((a, b) => a + b).collect()

  was:
Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.  

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map { x =>
 | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 
4) throw new RuntimeException("Bad executor")
 | else (x % 3, x)
 | }.reduceByKey((a, b) => a + b).collect()


> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
>  
> sc.parallelize(1 to 1, 10).map
> { x => | if (SparkEnv.get.executorId.toInt >= 1 && 
> SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") | else (x % 3, x) | }
> .reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-24415:
-

 Summary: Stage page aggregated executor metrics wrong when 
failures 
 Key: SPARK-24415
 URL: https://issues.apache.org/jira/browse/SPARK-24415
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Thomas Graves
 Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png

Running with spark 2.3 on yarn and having task failures and blacklisting, the 
aggregated metrics by executor are not correct.  In my example it should have 2 
failed tasks but it only shows one.  

I will attach screen shot.

To reproduce:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
--executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
--conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
"spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
"spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
"spark.blacklist.killBlacklistedExecutors=true"

 

sc.parallelize(1 to 1, 10).map { x =>
 | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 
4) throw new RuntimeException("Bad executor")
 | else (x % 3, x)
 | }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-05-29 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24415:
--
Attachment: Screen Shot 2018-05-29 at 2.15.38 PM.png

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.  
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
>  
> sc.parallelize(1 to 1, 10).map { x =>
>  | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 
> 4) throw new RuntimeException("Bad executor")
>  | else (x % 3, x)
>  | }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494083#comment-16494083
 ] 

Thomas Graves commented on SPARK-24414:
---

to reproduce this simply start a shell:

$SPARK_HOME/bin/spark-shell --num-executors 5  --master yarn --deploy-mode 
client

Run something that gets some tasks failures but not all:

sc.parallelize(1 to 1, 10).map { x =>
 | if (SparkEnv.get.executorId.toInt >= 1 && SparkEnv.get.executorId.toInt <= 
4) throw new RuntimeException("Bad executor")
 | else (x % 3, x)
 | }.reduceByKey((a, b) => a + b).collect()

 

Go to the stages page and you will only see 10 tasks rendered when it should 
has 21 total between succeeded and failed. 

> Stages page doesn't show all task attempts when failures
> 
>
> Key: SPARK-24414
> URL: https://issues.apache.org/jira/browse/SPARK-24414
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
>
> If you have task failures, the StagePage doesn't render all the task attempts 
> properly.  It seems to make the table the size of the total number of 
> successful tasks rather then including all the failed tasks.
> Even though the table size is smaller, if you sort by various columns you can 
> see all the tasks are actually there, it just seems the size of the table is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24414) Stages page doesn't show all task attempts when failures

2018-05-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-24414:
-

 Summary: Stages page doesn't show all task attempts when failures
 Key: SPARK-24414
 URL: https://issues.apache.org/jira/browse/SPARK-24414
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Thomas Graves


If you have task failures, the StagePage doesn't render all the task attempts 
properly.  It seems to make the table the size of the total number of 
successful tasks rather then including all the failed tasks.

Even though the table size is smaller, if you sort by various columns you can 
see all the tasks are actually there, it just seems the size of the table is 
wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-29 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494058#comment-16494058
 ] 

Ruslan Dautkhanov commented on SPARK-24356:
---

Another improvement for YARN NodeManagers we saw that could decrease GC 
pressure is to decrease io.netty.allocator.maxOrder from default 11 down to 8. 
Which will decrease netty buffers from 16Mb to 2Mb. 

Thanks to [~mi...@cloudera.com] for helping to identify this one too

{quote}
Netty code responsible for highly underutilized buffers that we discussed. Long 
story short, I think I found the variables that control these byte[] arrays 
referenced by io.netty.buffer.PoolChunk.memory. Check the code of 
http://netty.io/4.0/xref/io/netty/buffer/PooledByteBufAllocator.html: lines 
39-40 look like:

private static final int DEFAULT_PAGE_SIZE;
private static final int DEFAULT_MAX_ORDER; // 8192 << 11 = 16 MiB per chunk

A little below you can see:

int defaultPageSize = SystemPropertyUtil.getInt("io.netty.allocator.pageSize", 
8192);
... // Some validation
DEFAULT_PAGE_SIZE = defaultPageSize;

int defaultMaxOrder = SystemPropertyUtil.getInt("io.netty.allocator.maxOrder", 
11);
... // Some validation
DEFAULT_MAX_ORDER = defaultMaxOrder;

And then from the rest of the code in this class, as well as PoolChunk, 
PoolChunkList and PoolArena, it is clear that the size of the said buffers is 
set as pageSize * (2^maxOrder), with the default values as above. 8192b * 
(2^11) = 16MB, which agrees with the buffer size obtained from the jxray 
report, that I previously mentioned.

So looks like to reduce the amount of memory wasted by these underutilized 
netty buffers, it's best to run the Yarn NM JVM with the 
"io.netty.allocator.maxOrder" explicitly set to something less than the default 
11 value. Decreasing this number by 1 will reduce the amount of memory consumed 
by this stuff by a factor of 2. I would suggest starting with property value 9 
or 8 - that seems like a reasonable balance between savings and safety.

{quote}

I got surprised to learn that YARN NM actually uses some Spark code (e.g. 
org.apache.spark.network.yarn.YarnShuffleService) so this issue could be common 
between YARN NM and Spark shuffle service. However we did not check if 
underutilized buffers in netty apply to Spark shuffle service too - might be a 
good idea to open another jira. 

jxray seems to be a great tool to find issues like these.

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
> Attachments: SPARK-24356.01.patch
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already inte

[jira] [Updated] (SPARK-22666) Spark datasource for image format

2018-05-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22666:
--
Summary: Spark datasource for image format  (was: Spark reader source for 
image format)

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494038#comment-16494038
 ] 

Apache Spark commented on SPARK-24337:
--

User 'PenguinToast' has created a pull request for this issue:
https://github.com/apache/spark/pull/21454

> Improve the error message for invalid SQL conf value
> 
>
> Key: SPARK-24337
> URL: https://issues.apache.org/jira/browse/SPARK-24337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Xiao Li
>Priority: Major
>
> Right now Spark will throw the following error message when a config is set 
> to an invalid value. It would be great if the error message contains the 
> config key so that it's easy to tell which one is wrong.
> {code}
> Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
> (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 1.6
>   at 
> org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
>   at 
> org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
>   ... 59 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24337:


Assignee: Apache Spark  (was: Xiao Li)

> Improve the error message for invalid SQL conf value
> 
>
> Key: SPARK-24337
> URL: https://issues.apache.org/jira/browse/SPARK-24337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Right now Spark will throw the following error message when a config is set 
> to an invalid value. It would be great if the error message contains the 
> config key so that it's easy to tell which one is wrong.
> {code}
> Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
> (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 1.6
>   at 
> org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
>   at 
> org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
>   ... 59 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24337:


Assignee: Xiao Li  (was: Apache Spark)

> Improve the error message for invalid SQL conf value
> 
>
> Key: SPARK-24337
> URL: https://issues.apache.org/jira/browse/SPARK-24337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Xiao Li
>Priority: Major
>
> Right now Spark will throw the following error message when a config is set 
> to an invalid value. It would be great if the error message contains the 
> config key so that it's easy to tell which one is wrong.
> {code}
> Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
> (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
> Fractional values are not supported. Input was: 1.6
>   at 
> org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
>   at 
> org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
>   at 
> org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
>   ... 59 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

2018-05-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493978#comment-16493978
 ] 

Thomas Graves commented on SPARK-24413:
---

[~imranr]  thoughts on this?

> Executor Blacklisting shouldn't immediately fail the application if dynamic 
> allocation is enabled and no active executors
> -
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you 
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
> doesn't matter in this case, can be on or off), if you have a task fail that 
> results in the 1 executor you have getting blacklisted, then your entire 
> application will fail.  The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one 
> task is the long tail and if it happens to hit a bad node that would 
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately 
> fail but it should let dynamic allocation try to get more executors. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

2018-05-29 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24413:
--
Summary: Executor Blacklisting shouldn't immediately fail the application 
if dynamic allocation is enabled and no active executors  (was: Executor 
Blacklisting shouldn't immediately fail the application if dynamic allocation 
is enabled and it doesn't have any other active executors )

> Executor Blacklisting shouldn't immediately fail the application if dynamic 
> allocation is enabled and no active executors
> -
>
> Key: SPARK-24413
> URL: https://issues.apache.org/jira/browse/SPARK-24413
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently with executor blacklisting enabled, dynamic allocation on, and you 
> only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
> doesn't matter in this case, can be on or off), if you have a task fail that 
> results in the 1 executor you have getting blacklisted, then your entire 
> application will fail.  The error you get is something like:
> Aborting TaskSet 0.0 because task 9 (partition 9)
> cannot run anywhere due to node and executor blacklist.
> This is very undesirable behavior because you may have a huge job but one 
> task is the long tail and if it happens to hit a bad node that would 
> blacklist it, the entire job fail.
> Ideally since dynamic allocation is on, the schedule should not immediately 
> fail but it should let dynamic allocation try to get more executors. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and it doesn't have any other active executors

2018-05-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-24413:
-

 Summary: Executor Blacklisting shouldn't immediately fail the 
application if dynamic allocation is enabled and it doesn't have any other 
active executors 
 Key: SPARK-24413
 URL: https://issues.apache.org/jira/browse/SPARK-24413
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves


Currently with executor blacklisting enabled, dynamic allocation on, and you 
only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting 
doesn't matter in this case, can be on or off), if you have a task fail that 
results in the 1 executor you have getting blacklisted, then your entire 
application will fail.  The error you get is something like:

Aborting TaskSet 0.0 because task 9 (partition 9)
cannot run anywhere due to node and executor blacklist.

This is very undesirable behavior because you may have a huge job but one task 
is the long tail and if it happens to hit a bad node that would blacklist it, 
the entire job fail.

Ideally since dynamic allocation is on, the schedule should not immediately 
fail but it should let dynamic allocation try to get more executors. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24412) Adding docs about automagical type casting in `isin` and `isInCollection` APIs

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24412:
---

 Summary: Adding docs about automagical type casting in `isin` and 
`isInCollection` APIs
 Key: SPARK-24412
 URL: https://issues.apache.org/jira/browse/SPARK-24412
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai
 Fix For: 2.4.0


We should let users know that those two APIs will compare data with autocasting.

See https://github.com/apache/spark/pull/21416#discussion_r191491943 for detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-05-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24411:
---

 Summary: Adding native Java tests for `isInCollection`
 Key: SPARK-24411
 URL: https://issues.apache.org/jira/browse/SPARK-24411
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai
 Fix For: 2.4.0


In the past, some of our Java APIs have been difficult to call from Java. We 
should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24371) Added isInCollection in DataFrame API for Scala and Java.

2018-05-29 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24371.
-
Resolution: Fixed

> Added isInCollection in DataFrame API for Scala and Java.
> -
>
> Key: SPARK-24371
> URL: https://issues.apache.org/jira/browse/SPARK-24371
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Implemented *{{isInCollection}}* in DataFrame API for both Scala and Java, so 
> users can do
> {code}
>  val profileDF = Seq(
>  Some(1), Some(2), Some(3), Some(4),
>  Some(5), Some(6), Some(7), None
>  ).toDF("profileID")
> val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")
> val result = profileDF.withColumn("isValid", 
> $"profileID".isInCollection(validUsers))
> result.show(10)
>  """
>  +--+--+
> |profileID|isValid|
> +--+--+
> |1|false|
> |2|false|
> |3|true|
> |4|false|
> |5|false|
> |6|true|
> |7|true|
> |null|null|
> +--+--+
>  """.stripMargin
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24296) Support replicating blocks larger than 2 GB

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24296:


Assignee: (was: Apache Spark)

> Support replicating blocks larger than 2 GB
> ---
>
> Key: SPARK-24296
> URL: https://issues.apache.org/jira/browse/SPARK-24296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Replicating blocks send the entire block data in one frame.  This results in 
> a failure on the receiving end for blocks larger than 2GB.
> We should change block replication to send the block data as a stream when 
> the block is large (building on the network changes in SPARK-6237).  This can 
> use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate 
> as a stream, the same as we do for fetching shuffle blocks and fetching 
> remote RDD blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24296) Support replicating blocks larger than 2 GB

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493864#comment-16493864
 ] 

Apache Spark commented on SPARK-24296:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21451

> Support replicating blocks larger than 2 GB
> ---
>
> Key: SPARK-24296
> URL: https://issues.apache.org/jira/browse/SPARK-24296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Replicating blocks send the entire block data in one frame.  This results in 
> a failure on the receiving end for blocks larger than 2GB.
> We should change block replication to send the block data as a stream when 
> the block is large (building on the network changes in SPARK-6237).  This can 
> use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate 
> as a stream, the same as we do for fetching shuffle blocks and fetching 
> remote RDD blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24296) Support replicating blocks larger than 2 GB

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24296:


Assignee: Apache Spark

> Support replicating blocks larger than 2 GB
> ---
>
> Key: SPARK-24296
> URL: https://issues.apache.org/jira/browse/SPARK-24296
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> Replicating blocks send the entire block data in one frame.  This results in 
> a failure on the receiving end for blocks larger than 2GB.
> We should change block replication to send the block data as a stream when 
> the block is large (building on the network changes in SPARK-6237).  This can 
> use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate 
> as a stream, the same as we do for fetching shuffle blocks and fetching 
> remote RDD blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493815#comment-16493815
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:34 PM:
-

Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 


was (Author: icexelloss):
Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 2016010

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493815#comment-16493815
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:34 PM:
-

Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and involvement from Spark committers and PMCs to help move 
discussion forward. 


was (Author: icexelloss):
Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
but needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493815#comment-16493815
 ] 

Li Jin commented on SPARK-22947:


Hi [~TomaszGaweda] thanks for your interest! Yes I am willing to work on this 
needs some help and engagement from Spark committers and PMCs to help move 
discussion forward. 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 2016010

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-05-29 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449935#comment-16449935
 ] 

Li Jin edited comment on SPARK-22947 at 5/29/18 4:33 PM:
-

I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 


was (Author: icexelloss):
I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> resu

[jira] [Commented] (SPARK-24319) run-example can not print usage

2018-05-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493759#comment-16493759
 ] 

Apache Spark commented on SPARK-24319:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/21450

> run-example can not print usage
> ---
>
> Key: SPARK-24319
> URL: https://issues.apache.org/jira/browse/SPARK-24319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Running "bin/run-example" with no args or with "–help" will not print usage 
> and just gives the error
> {noformat}
> $ bin/run-example
> Exception in thread "main" java.lang.IllegalArgumentException: Missing 
> application resource.
>     at 
> org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
>     at org.apache.spark.launcher.Main.main(Main.java:86){noformat}
> it looks like there is an env var in the script that shows usage, but it's 
> getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24319) run-example can not print usage

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24319:


Assignee: Apache Spark

> run-example can not print usage
> ---
>
> Key: SPARK-24319
> URL: https://issues.apache.org/jira/browse/SPARK-24319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> Running "bin/run-example" with no args or with "–help" will not print usage 
> and just gives the error
> {noformat}
> $ bin/run-example
> Exception in thread "main" java.lang.IllegalArgumentException: Missing 
> application resource.
>     at 
> org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
>     at org.apache.spark.launcher.Main.main(Main.java:86){noformat}
> it looks like there is an env var in the script that shows usage, but it's 
> getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24319) run-example can not print usage

2018-05-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24319:


Assignee: (was: Apache Spark)

> run-example can not print usage
> ---
>
> Key: SPARK-24319
> URL: https://issues.apache.org/jira/browse/SPARK-24319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Running "bin/run-example" with no args or with "–help" will not print usage 
> and just gives the error
> {noformat}
> $ bin/run-example
> Exception in thread "main" java.lang.IllegalArgumentException: Missing 
> application resource.
>     at 
> org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
>     at org.apache.spark.launcher.Main.main(Main.java:86){noformat}
> it looks like there is an env var in the script that shows usage, but it's 
> getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-29 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493637#comment-16493637
 ] 

Wenchen Fan commented on SPARK-24410:
-

The `UnionExec#outputPartitioning` should be smarter and propagate the 
children's output partitioning if possible. cc [~viirya]

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-29 Thread Wenbo Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493569#comment-16493569
 ] 

Wenbo Zhao commented on SPARK-24373:


[~mgaido] Thanks. I didn't look the comment carefully. 

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24373) "df.cache() df.count()" no longer eagerly caches data when the analyzed plans are different after re-analyzing the plans

2018-05-29 Thread Wenbo Zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbo Zhao updated SPARK-24373:
---
Comment: was deleted

(was: Same question as [~icexelloss]. Also, any plan to make your fix into a 
more complete status, e.g. also fix the {{flatMapGroupsInPandas}} ? )

> "df.cache() df.count()" no longer eagerly caches data when the analyzed plans 
> are different after re-analyzing the plans
> 
>
> Key: SPARK-24373
> URL: https://issues.apache.org/jira/browse/SPARK-24373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenbo Zhao
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Here is the code to reproduce in local mode
> {code:java}
> scala> val df = sc.range(1, 2).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> val myudf = udf({x: Long => println(""); x + 1})
> myudf: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,LongType,Some(List(LongType)))
> scala> val df1 = df.withColumn("value1", myudf(col("value")))
> df1: org.apache.spark.sql.DataFrame = [value: bigint, value1: bigint]
> scala> df1.cache
> res0: df1.type = [value: bigint, value1: bigint]
> scala> df1.count
> res1: Long = 1 
> scala> df1.count
> res2: Long = 1
> scala> df1.count
> res3: Long = 1
> {code}
>  
> in Spark 2.2, you could see it prints "". 
> In the above example, when you do explain. You could see
> {code:java}
> scala> df1.explain(true)
> == Parsed Logical Plan ==
> 'Project [value#2L, UDF('value) AS value1#5]
> +- AnalysisBarrier
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> value: bigint, value1: bigint
> Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> == Physical Plan ==
> *(1) InMemoryTableScan [value#2L, value1#5L]
> +- InMemoryRelation [value#2L, value1#5L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> +- *(1) Project [value#2L, UDF(value#2L) AS value1#5L]
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
> but the ImMemoryTableScan is mising in the following explain()
> {code:java}
> scala> df1.groupBy().count().explain(true)
> == Parsed Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else UDF(value#2L) AS 
> value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Analyzed Logical Plan ==
> count: bigint
> Aggregate [count(1) AS count#170L]
> +- Project [value#2L, if (isnull(value#2L)) null else if (isnull(value#2L)) 
> null else UDF(value#2L) AS value1#5L]
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count#170L]
> +- Project
> +- SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- ExternalRDD [obj#1L]
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count#170L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#175L])
> +- *(1) Project
> +- *(1) SerializeFromObject [input[0, bigint, false] AS value#2L]
> +- Scan ExternalRDDScan[obj#1L]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-29 Thread Ohad Raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493561#comment-16493561
 ] 

Ohad Raviv commented on SPARK-24410:


[~sowen], [~cloud_fan] - could you please check if my assessment is correct? 
thanks!

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-05-29 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-24410:
--

 Summary: Missing optimization for Union on bucketed tables
 Key: SPARK-24410
 URL: https://issues.apache.org/jira/browse/SPARK-24410
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ohad Raviv


A common use-case we have is of a partially aggregated table and daily 
increments that we need to further aggregate. we do this my unioning the two 
tables and aggregating again.
we tried to optimize this process by bucketing the tables, but currently it 
seems that the union operator doesn't leverage the tables being bucketed (like 
the join operator).

for example, for two bucketed tables a1,a2:

{code}
sparkSession.range(N).selectExpr(
  "id as key",
  "id % 2 as t1",
  "id % 3 as t2")
.repartition(col("key"))
.write
  .mode(SaveMode.Overwrite)
.bucketBy(3, "key")
.sortBy("t1")
.saveAsTable("a1")

sparkSession.range(N).selectExpr(
  "id as key",
  "id % 2 as t1",
  "id % 3 as t2")
  .repartition(col("key"))
  .write.mode(SaveMode.Overwrite)
  .bucketBy(3, "key")
  .sortBy("t1")
  .saveAsTable("a2")

{code}
for the join query we get the "SortMergeJoin"
{code}
select * from a1 join a2 on (a1.key=a2.key)

== Physical Plan ==
*(3) SortMergeJoin [key#24L], [key#27L], Inner
:- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
:  +- *(1) Project [key#24L, t1#25L, t2#26L]
: +- *(1) Filter isnotnull(key#24L)
:+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
true, Format: Parquet, Location: 
InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
PushedFilters: [IsNotNull(key)], ReadSchema: 
struct
+- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
   +- *(2) Project [key#27L, t1#28L, t2#29L]
  +- *(2) Filter isnotnull(key#27L)
 +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
true, Format: Parquet, Location: 
InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
PushedFilters: [IsNotNull(key)], ReadSchema: 
struct
{code}

but for aggregation after union we get a shuffle:
{code}
select key,count(*) from (select * from a1 union all select * from a2)z group 
by key

== Physical Plan ==
*(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
count(1)#36L])
+- Exchange hashpartitioning(key#25L, 1)
   +- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
output=[key#25L, count#38L])
  +- Union
 :- *(1) Project [key#25L]
 :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
 +- *(2) Project [key#28L]
+- *(2) FileScan parquet default.a2[key#28L] Batched: true, Format: 
Parquet, Location: InMemoryFileIndex[file:/some/where/spark-warehouse/a2], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >