date:20180418

[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-18 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443588#comment-16443588
 ] 

Takeshi Yamamuro commented on SPARK-23711:
--

+1, I also think so.

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443573#comment-16443573
 ] 

Liang-Chi Hsieh commented on SPARK-23711:
-

About this, I suppose that in {{WholeStageCodegenExec}}, we already have logic 
to fallback to interpreted mode? Other places we need to add this kind of 
fallback?

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated SPARK-24008:
--
Description: 
SQL / Hive Context fails with NullPointerException while getting configuration 
from SQLConf. This happens when the MemoryStore is filled with lot of broadcast 
and started dropping and then SQL / Hive Context is created and broadcast. When 
using this Context to access a table fails with below NullPointerException.

Repro is attached - the Spark Example which fills the MemoryStore with 
broadcasts and then creates and accesses a SQL Context.

{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at 
org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
at SparkHiveExample$.main(SparkHiveExample.scala:76)
at SparkHiveExample.main(SparkHiveExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)


18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
java.lang.NullPointerException 
java.lang.NullPointerException 
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
at 
org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
 
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
 
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) 
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 

{code}

MemoryStore got filled and started dropping the blocks.
{code}
18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
memory (estimated size 78.1 MB, free 64.4 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
in memory (estimated size 1522.0 B, free 64.4 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
memory (estimated size 350.9 KB, free 64.1 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
in memory (estimated size 29.9 KB, free 64.0 MB)
18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
memory (estimated size 78.1 MB, free 64.7 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
in memory (estimated size 1522.0 B, free 64.7 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 136.0 B, free 64.7 MB)
18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
{code}


Fix is to remove broadcasting SQL/Hive Context or Increasing the Driver memory.


  was:
SQL / Hive Context fails with NullPointerException while getting configuration 
from SQLConf. This happens when the MemoryStore is filled with lot of broadcast 
and started dropping and then SQL / Hive Context is created and broadcast. When 
using this Context to access a table fails with below NullPointerException.

Repro is attached - the Spark Example which fills the MemoryStore with 
broadcasts and then creates and accesses a SQL Context.

{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at 
org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
at SparkHiveExample$.main(SparkHiveExample.scala:76)
at SparkHiveExample.main(SparkHiveExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at

[jira] [Commented] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?

2018-04-18 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443509#comment-16443509
 ] 

Hyukjin Kwon commented on SPARK-24015:
--

[~chenk], is this a question? I think we would have a better answer if you ask 
this to a mailing list. We don't usually open a JIRA for a question.
I think it's better to ask it to the mailing list first before describing here 
as a bug.

> Does method SchedulerBackend.isReady is really well implemented?
> 
>
> Key: SPARK-24015
> URL: https://issues.apache.org/jira/browse/SPARK-24015
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: chenk
>Priority: Minor
>
> The mothod *isReady* in the interface *SchedulerBackend*,which implement by 
> *CoarseGrainedSchedulerBackend* , as well as has default value is True.
> And, the implemention in *CoarseGrainedSchedulerBackend* use 
> *sufficientResourcesRegistered* mothed to check it , which default value is 
> True too.
> So,What does the meaning of the *isReady* method? Because it will alway 
> return true actually.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?

2018-04-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24015.
--
Resolution: Invalid

Let me leave this resolved but please let me know if I am mistaken.

> Does method SchedulerBackend.isReady is really well implemented?
> 
>
> Key: SPARK-24015
> URL: https://issues.apache.org/jira/browse/SPARK-24015
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: chenk
>Priority: Minor
>
> The mothod *isReady* in the interface *SchedulerBackend*,which implement by 
> *CoarseGrainedSchedulerBackend* , as well as has default value is True.
> And, the implemention in *CoarseGrainedSchedulerBackend* use 
> *sufficientResourcesRegistered* mothed to check it , which default value is 
> True too.
> So,What does the meaning of the *isReady* method? Because it will alway 
> return true actually.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-18 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23919.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21037
[https://github.com/apache/spark/pull/21037]

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-18 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23919:
-

Assignee: Kazuaki Ishizaki

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-18 Thread wuyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-24021:
-
Description: 
There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:

 
{code:java}
val blacklistedExecsOnNode =
    nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
blacklistedExecsOnNode += exec{code}
 

where first *exec* should be *host*.

  was:
There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:

```

val blacklistedExecsOnNode =

    nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
blacklistedExecsOnNode += exec

```

where first **exec** should be **host**.


> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
>
> There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
>  
> {code:java}
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec{code}
>  
> where first *exec* should be *host*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24021:


Assignee: Apache Spark

> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>  Labels: easyfix
>
> There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
> ```
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec
> ```
> where first **exec** should be **host**.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24021:


Assignee: (was: Apache Spark)

> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
>
> There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
> ```
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec
> ```
> where first **exec** should be **host**.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443496#comment-16443496
 ] 

Apache Spark commented on SPARK-24021:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/21104

> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: easyfix
>
> There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
> ```
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec
> ```
> where first **exec** should be **host**.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-18 Thread wuyi (JIRA)

wuyi created SPARK-24021:


 Summary: Fix bug in BlacklistTracker's 
updateBlacklistForFetchFailure
 Key: SPARK-24021
 URL: https://issues.apache.org/jira/browse/SPARK-24021
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: wuyi


There‘s a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:

```

val blacklistedExecsOnNode =

    nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
blacklistedExecsOnNode += exec

```

where first **exec** should be **host**.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443495#comment-16443495
 ] 

Saisai Shao commented on SPARK-24008:
-

The case you provided seems not so valid. You're trying to broadcast SQL entry 
point and RDDs, which should not use broadcasting. I'm not sure if the issue 
here is due to invalid usage pattern. If you met such issue, would you please 
provide a more meaningful case. 

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443484#comment-16443484
 ] 

Prabhu Joseph commented on SPARK-24008:
---

Yes it's driver specific. But better to handle this case instead of failing.

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443475#comment-16443475
 ] 

Saisai Shao commented on SPARK-24008:
-

Why do you need to broadcast {{SQLContext}} or {{HiveContext}}?

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24014.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21098
[https://github.com/apache/spark/pull/21098]

> Add onStreamingStarted method to StreamingListener
> --
>
> Key: SPARK-24014
> URL: https://issues.apache.org/jira/browse/SPARK-24014
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Trivial
> Fix For: 2.4.0, 2.3.1
>
>
> The {{StreamingListener}} in PySpark side seems to be lack of 
> {{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24014:
---

Assignee: Liang-Chi Hsieh

> Add onStreamingStarted method to StreamingListener
> --
>
> Key: SPARK-24014
> URL: https://issues.apache.org/jira/browse/SPARK-24014
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Trivial
>
> The {{StreamingListener}} in PySpark side seems to be lack of 
> {{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17901) NettyRpcEndpointRef: Error sending message and Caused by: java.util.ConcurrentModificationException

2018-04-18 Thread zhangzhaolong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443449#comment-16443449
 ] 

zhangzhaolong commented on SPARK-17901:
---

hi [~srowen], did it reopen on master ? we encountered a same problem.

> NettyRpcEndpointRef: Error sending message and Caused by: 
> java.util.ConcurrentModificationException
> ---
>
> Key: SPARK-17901
> URL: https://issues.apache.org/jira/browse/SPARK-17901
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Major
>
> I have 2 data frames one with 10K rows and 10,000 columns and another with 4M 
> rows with 50 columns. I joined this and trying to find mean of merged data 
> set,
> i calculated the mean using lamda using python mean() function. I cant write 
> in pyspark due to 64KB code limit issue.
> After calculating the mean i did rdd.take(2). it works.But creating the DF 
> from RDD and DF.show is progress for more than 2 hours (I stopped the 
> process) with below message  (102 GB , 6 cores per node -- total 10 nodes+ 
> 1master)
> 16/10/13 04:36:31 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(9,[Lscala.Tuple2;@68eedafb,BlockManagerId(9, 10.63.136.103, 
> 35729))] in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.ArrayList.writeObject(ArrayList.java:766)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
>

[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization

2018-04-18 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443392#comment-16443392
 ] 

Henry Robinson commented on SPARK-24020:


This sounds like a 'band join' (e.g. 
http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/vldb91.pdf, also see 
[Oracle's 
documentation|https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sqlrf/Joins.html#GUID-568EC26F-199A-4339-BFD9-C4A0B9588937]).
 

Does your implementation also handle non-equi joins? e.g. {{WHERE t1.Y BETWEEN 
t2.Y -d AND t2.Y +d}}, with no equality clause in the join predicates. 

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24020) Sort-merge join inner range optimization

2018-04-18 Thread Petar Zecevic (JIRA)

Petar Zecevic created SPARK-24020:
-

 Summary: Sort-merge join inner range optimization
 Key: SPARK-24020
 URL: https://issues.apache.org/jira/browse/SPARK-24020
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Petar Zecevic


The problem we are solving is the case where you have two big tables 
partitioned by X column, but also sorted by Y column (within partitions) and 
you need to calculate an expensive function on the joined rows. During a 
sort-merge join, Spark will do cross-joins of all rows that have the same X 
values and calculate the function's value on all of them. If the two tables 
have a large number of rows per X, this can result in a huge number of 
calculations.

We hereby propose an optimization that would allow you to reduce the number of 
matching rows per X using a range condition on Y columns of the two tables. 
Something like:

... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have no influence 
on the number of rows (per X) being checked because these extra conditions are 
put in the same block with the function being calculated.

Here we propose a change to the sort-merge join so that, when these extra 
conditions are specified, a queue is used instead of the 
ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a moving 
window across the values from the right relation as the left row changes. You 
could call this a combination of an equi-join and a theta join (we call it 
"sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporal distance 
calculations.

The optimization should be triggered automatically when an equi-join expression 
is present AND lower and upper range conditions on a secondary column are 
specified. If the tables aren't sorted by both columns, appropriate sorts 
should be added.

To limit the impact of this change we also propose adding a new parameter 
(tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which could 
be used to switch off the optimization entirely.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23775.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20888
[https://github.com/apache/spark/pull/20888]

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23775) Flaky test: DataFrameRangeSuite

2018-04-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23775:
--

Assignee: Gabor Somogyi

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23775
> URL: https://issues.apache.org/jira/browse/SPARK-23775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: filtered.log, filtered_more_logs.log
>
>
> DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays 
> sometimes in an infinite loop and times out the build.
> I presume the original intention of this test is to start a job with range 
> and just cancel it.
> The submitted job has 2 stages but I think the author tried to cancel the 
> first stage with ID 0 which is not the case here:
> {code:java}
> eventually(timeout(10.seconds), interval(1.millis)) {
>   assert(DataFrameRangeSuite.stageToKill > 0)
> }
> {code}
> All in all if the first stage is slower than 10 seconds it throws 
> TestFailedDueToTimeoutException and cancelStage will be never ever called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-18 Thread Jordan Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443239#comment-16443239
 ] 

Jordan Moore commented on SPARK-18057:
--

Thanks, [~c...@koeninger.org]. I will forward that information to the relevant 
parties. 

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes

2018-04-18 Thread Karthik Palaniappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Palaniappan reopened SPARK-19941:
-

This should be marked as a duplicate of 
https://issues.apache.org/jira/browse/SPARK-20628.

> Spark should not schedule tasks on executors on decommissioning YARN nodes
> --
>
> Key: SPARK-19941
> URL: https://issues.apache.org/jira/browse/SPARK-19941
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 2.1.0
> Environment: Hadoop 2.8.0-rc1
>Reporter: Karthik Palaniappan
>Priority: Major
>
> Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in 
> YARN: https://issues.apache.org/jira/browse/YARN-914
> Essentially you can mark nodes to be decommissioned, and let them a) finish 
> work in progress and b) finish serving shuffle data. But no new work will be 
> scheduled on the node.
> Spark should respect when NMs are set to decommissioned, and similarly 
> decommission executors on those nodes by not scheduling any more tasks on 
> them.
> It looks like in the future YARN may inform the app master when containers 
> will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I 
> don't think Spark should schedule based on a timeout. We should gracefully 
> decommission the executor as fast as possible (which is the spirit of 
> YARN-914). The app master can query the RM for NM statuses (if it doesn't 
> already have them) and stop scheduling on executors on NMs that are 
> decommissioning.
> Stretch feature: The timeout may be useful in determining whether running 
> further tasks on the executor is even helpful. Spark may be able to tell that 
> shuffle data will not be consumed by the time the node is decommissioned, so 
> it is not worth computing. The executor can be killed immediately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-04-18 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443154#comment-16443154
 ] 

Franck Tago commented on SPARK-23519:
-

thanks for the suggestion [~shahid]

The issue with your suggestion is that I dynamically generate the create view 
statement  ;Moreover   the select statement is kind of Opaque to me because it 
is provided by the customer. 

 

It would be nice is spark could fix such a simple  case.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-18 Thread Barry Becker (JIRA)

Barry Becker created SPARK-24019:


 Summary: AnalysisException for Window function expression to 
compute derivative
 Key: SPARK-24019
 URL: https://issues.apache.org/jira/browse/SPARK-24019
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
 Environment: Ubuntu, spark 2.1.1, standalone.
Reporter: Barry Becker


I am using spark 2.1.1 currently.

I created an expression to compute the derivative of some series data using a 
window function.

I have a simple reproducible case of the error.

I'm only filing this bug because the error message says "Please file a bug 
report with this error message, stack trace, and the query."

Here they are:
{code:java}
((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 
FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 1, 
null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple Window 
Specifications (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC 
NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 
FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 ASC 
NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).

Please file a bug report with this error message, stack trace, and the query.;
org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 
FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 
PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 1, 
null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - 
coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, 
sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 
ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
derivative#14 has multiple Window Specifications 
(ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 
FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 ASC 
NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
Please file a bug report with this error message, stack trace, and the query.;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code}
And here is a simple unit test that can be used to reproduce the problem:
{code:java}
import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION
import org.apache.spark.sql.Column
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.scalatest.FunSuite
import com.mineset.spark.testsupport.SparkTestCase._


/**
* Test to see that window functions work as expected on spark.
* @author Barry Becker
*/
class WindowFunctionSuite extends FunSuite {

val simpleDf = createSimpleData()


test("Window function for finding derivatives for 2 series") {

val window =
Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1)
// Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
Ylead).
// This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
// If the lead or lag points are null, then we fall back on using the middle 
point.
val yLead = coalesce(lead("value", 1).over(window), col("value"))
val yLag = coalesce(lag("value", 1).over(window), col("value"))
val xLead = coalesce(lead("sequence_num", 1).over(window), col("sequence_num"))
val xLag = coalesce(lag("sequence_num", 1).over(window), col("sequence_num"))
val derivative: Column = (yLead - yLag) / (xLead - xLag)

val resultDf =

[jira] [Updated] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-04-18 Thread Jean-Francis Roy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Francis Roy updated SPARK-24018:
-
Description: 
On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
[https://spark.apache.org/docs/latest/hadoop-provided.html], set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell, enter the following:

 
{code:java}
import spark.implicits._
val df = List(1, 2, 3, 4).toDF
df.write
  .format("parquet")
  .option("compression", "snappy")
  .mode("overwrite")
  .save("test.parquet")
{code}
 

 

This fails with the following:
{noformat}
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
 

  Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder 
solves the issue.

  was:
On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
https://spark.apache.org/docs/latest/hadoop-provided.html, set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell,

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-18 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443025#comment-16443025
 ] 

Edwina Lu commented on SPARK-23206:
---

We are planning a design discussion for early next week. Please let me know if 
you are interested in attending, and I will post once there is a set date/time.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24018) Spark-without-hadoop package fails to create or read parquet files with snappy compression

2018-04-18 Thread Jean-Francis Roy (JIRA)

Jean-Francis Roy created SPARK-24018:


 Summary: Spark-without-hadoop package fails to create or read 
parquet files with snappy compression
 Key: SPARK-24018
 URL: https://issues.apache.org/jira/browse/SPARK-24018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.0
Reporter: Jean-Francis Roy


On a brand-new installation of Spark 2.3.0 with a user-provided hadoop-2.8.3, 
Spark fails to read or write dataframes in parquet format with snappy 
compression.

This is due to an incompatibility between the snappy-java version that is 
required by parquet (parquet is provided in Spark jars but snappy isn't) and 
the version that is available from hadoop-2.8.3.

 

Steps to reproduce:
 * Download and extract hadoop-2.8.3
 * Download and extract spark-2.3.0-without-hadoop
 * export JAVA_HOME, HADOOP_HOME, SPARK_HOME, PATH
 * Following instructions from 
https://spark.apache.org/docs/latest/hadoop-provided.html, set 
SPARK_DIST_CLASSPATH=$(hadoop classpath) in spark-env.sh
 * Start a spark-shell, enter the following:

 
{code:java}
import spark.implicits._
val df = List(1, 2, 3, 4).toDF
df.write
  .format("parquet")
  .option("compression", "snappy")
  .mode("overwrite")
  .save("test.parquet")
{code}
 

 

This fails with the following:
{noformat}
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at 
org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at 
org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
 at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
 at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) 
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
 at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
 at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{noformat}
Downloading snappy-java-1.1.2.6.jar and placing it in Sparks's jar folder 
solves the issue.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23915) High-order function: array_except(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442960#comment-16442960
 ] 

Apache Spark commented on SPARK-23915:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21103

> High-order function: array_except(x, y) → array
> ---
>
> Key: SPARK-23915
> URL: https://issues.apache.org/jira/browse/SPARK-23915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of elements in x but not in y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23915) High-order function: array_except(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23915:


Assignee: Apache Spark

> High-order function: array_except(x, y) → array
> ---
>
> Key: SPARK-23915
> URL: https://issues.apache.org/jira/browse/SPARK-23915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of elements in x but not in y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23915) High-order function: array_except(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23915:


Assignee: (was: Apache Spark)

> High-order function: array_except(x, y) → array
> ---
>
> Key: SPARK-23915
> URL: https://issues.apache.org/jira/browse/SPARK-23915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of elements in x but not in y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24017) Refactor ExternalCatalog to be an interface

2018-04-18 Thread Xiao Li (JIRA)

Xiao Li created SPARK-24017:
---

 Summary: Refactor ExternalCatalog to be an interface
 Key: SPARK-24017
 URL: https://issues.apache.org/jira/browse/SPARK-24017
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


This refactors the external catalog to be an interface. It can be easier for 
the future work in the catalog federation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-04-18 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442948#comment-16442948
 ] 

Xiao Li commented on SPARK-23443:
-

We need to clean the ExternalCatalog interface at first before the catalog 
federation. 

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-18 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442928#comment-16442928
 ] 

Edwina Lu commented on SPARK-23206:
---

[~assia6], there is an initial PR:  [[Github] Pull Request #20940 
(edwinalu)|https://github.com/apache/spark/pull/20940], and I am working on 
making changes based on comments.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17508) Setting weightCol to None in ML library causes an error

2018-04-18 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-17508.
--
Resolution: Won't Fix

Resolving this for now unless there is more interest in the fix

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23963:

Fix Version/s: 2.3.1
   2.2.2

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-18 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442838#comment-16442838
 ] 

Cody Koeninger commented on SPARK-18057:


[~cricket007] here's a branch with spark 2.1.1 / kafka 0.11.0.2 if you want to 
try it out and see if it fixes your issue  
[https://github.com/koeninger/spark-1/tree/kafka-dependency-update-2.1.1]

I assume you'll need to skip tests in order to build it, I didn't do anything 
more than updating dependency + fixing compile errors.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled

2018-04-18 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442802#comment-16442802
 ] 

Reynold Xin commented on SPARK-24010:
-

It's better to throw an error message to tell the user the db doesn't exist, 
rather than the table. Also other catalog implementations in the future might 
not support this.

 

> Select from table needs read access on DB folder when storage based auth is 
> enabled
> ---
>
> Key: SPARK-24010
> URL: https://issues.apache.org/jira/browse/SPARK-24010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> When HMS enables storage based authorization, SparkSQL requires read access 
> on DB folder in order to select from a table. Such requirement doesn't seem 
> necessary and is not required in Hive.
> The reason is when Analyzer tries to resolve a relation, it calls 
> [SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469].
>  This will call the metastore get_database API which will perform 
> authorization check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-18 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442787#comment-16442787
 ] 

Hyukjin Kwon commented on SPARK-24006:
--

Yea, I wouldn't. Just simply wondering how much it could affect in practice.  I 
would appreciate. It's even better if there's an actual case to check.

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-18 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442783#comment-16442783
 ] 

Ruslan Dautkhanov commented on SPARK-23963:
---

I thought it's just a matter of a Spark committer to commit the same PR 
[https://github.com/apache/spark/pull/21043] to a different branch? Spark2.2 in 
this case.
This PR gives 24x improvement on 6000 columns as you discovered, so I think 
this 1-line change should be admitted to Spark 2.2 fairly easily. 

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-18 Thread Xianjin YE (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442781#comment-16442781
 ] 

Xianjin YE commented on SPARK-24006:


Could you leave this issue open for a while? In my company, we are trying to 
scaling Spark Application with large data and a lot of executors in the near 
future. 

I will report back when I had first experience how this impact dynamic 
allocation.

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-18 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976
 ] 

Kazuaki Ishizaki edited comment on SPARK-23933 at 4/18/18 4:22 PM:
---

[~smilegator] [~ueshin] Could you favor us?
SparkSQL already uses syntax of {{map}} function for the different purpose.

Even if we limit two array in the argument list, we may have conflict between 
this new feature and creating a map with one entry having an array for key and 
value. Do you have any good idea?


{code}
@ExpressionDescription(
  usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the 
given key/value pairs.",
  examples = """
Examples:
  > SELECT _FUNC_(1.0, '2', 3.0, '4');
   {1.0:"2",3.0:"4"}
  """)
case class CreateMap(children: Seq[Expression]) extends Expression {
...
{code}


was (Author: kiszk):
[~smilegator] [~ueshin] Could you favor us?
SparkSQL already uses syntax of {{map}} function for the similar purpose.

Even if we limit two array in the argument list, we may have conflict between 
this new feature and creating a map with one entry having an array for key and 
value. Do you have any good idea?


{code}
@ExpressionDescription(
  usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the 
given key/value pairs.",
  examples = """
Examples:
  > SELECT _FUNC_(1.0, '2', 3.0, '4');
   {1.0:"2",3.0:"4"}
  """)
case class CreateMap(children: Seq[Expression]) extends Expression {
...
{code}

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23913) High-order function: array_intersect(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23913:


Assignee: Apache Spark

> High-order function: array_intersect(x, y) → array
> --
>
> Key: SPARK-23913
> URL: https://issues.apache.org/jira/browse/SPARK-23913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the intersection of x and y, without 
> duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23913) High-order function: array_intersect(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23913:


Assignee: (was: Apache Spark)

> High-order function: array_intersect(x, y) → array
> --
>
> Key: SPARK-23913
> URL: https://issues.apache.org/jira/browse/SPARK-23913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the intersection of x and y, without 
> duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442766#comment-16442766
 ] 

Apache Spark commented on SPARK-23913:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21102

> High-order function: array_intersect(x, y) → array
> --
>
> Key: SPARK-23913
> URL: https://issues.apache.org/jira/browse/SPARK-23913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the intersection of x and y, without 
> duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-18 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442735#comment-16442735
 ] 

Hyukjin Kwon commented on SPARK-24006:
--

Would be good to make a small test with the logic above and see if how much it 
improves. If that's ignorable in practice, I would leave this resolved.

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?

2018-04-18 Thread chenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenk updated SPARK-24015:
--
  Docs Text:   (was: The mothod isReady in the interface 
SchedulerBackend,which implement by CoarseGrainedSchedulerBackend , as well as 
has default value is True.

And, the implemention in  CoarseGrainedSchedulerBackend use 
sufficientResourcesRegistered mothed to check it , which default value is True 
too.

So,What does the meaning of the isReady method,because it will alway return 
true actually.)
Description: 
The mothod *isReady* in the interface *SchedulerBackend*,which implement by 
*CoarseGrainedSchedulerBackend* , as well as has default value is True.

And, the implemention in *CoarseGrainedSchedulerBackend* use 
*sufficientResourcesRegistered* mothed to check it , which default value is 
True too.

So,What does the meaning of the *isReady* method? Because it will alway return 
true actually.

> Does method SchedulerBackend.isReady is really well implemented?
> 
>
> Key: SPARK-24015
> URL: https://issues.apache.org/jira/browse/SPARK-24015
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: chenk
>Priority: Minor
>
> The mothod *isReady* in the interface *SchedulerBackend*,which implement by 
> *CoarseGrainedSchedulerBackend* , as well as has default value is True.
> And, the implemention in *CoarseGrainedSchedulerBackend* use 
> *sufficientResourcesRegistered* mothed to check it , which default value is 
> True too.
> So,What does the meaning of the *isReady* method? Because it will alway 
> return true actually.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24016) Yarn does not update node blacklist in static allocation

2018-04-18 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24016:


 Summary: Yarn does not update node blacklist in static allocation
 Key: SPARK-24016
 URL: https://issues.apache.org/jira/browse/SPARK-24016
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, YARN
Affects Versions: 2.3.0
Reporter: Imran Rashid


Task-based blacklisting keeps track of bad nodes, and updates YARN with that 
set of nodes so that Spark will not receive more containers on that node.  
However, that only happens with dynamic allocation.  Though its far more 
important with dynamic allocation, even with static allocation this matters; if 
executors die, or if the cluster was too busy at the original resource request 
to give all the containers, the spark application will add new containers in 
the middle.  And we want an updated node blacklist for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented?

2018-04-18 Thread chenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenk updated SPARK-24015:
--
Summary: Does method SchedulerBackend.isReady is really well implemented?  
(was: Does method SchedulerBackend.isReady is really well implemented)

> Does method SchedulerBackend.isReady is really well implemented?
> 
>
> Key: SPARK-24015
> URL: https://issues.apache.org/jira/browse/SPARK-24015
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: chenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24015) Does method SchedulerBackend.isReady is really well implemented

2018-04-18 Thread chenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenk updated SPARK-24015:
--
Summary: Does method SchedulerBackend.isReady is really well implemented  
(was: Does method BackendReady.isReady is really well implemented)

> Does method SchedulerBackend.isReady is really well implemented
> ---
>
> Key: SPARK-24015
> URL: https://issues.apache.org/jira/browse/SPARK-24015
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: chenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.

2018-04-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24007.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1
   2.2.2

> EqualNullSafe for FloatType and DoubleType might generate a wrong result by 
> codegen.
> 
>
> Key: SPARK-24007
> URL: https://issues.apache.org/jira/browse/SPARK-24007
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: correctness
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong 
> result by codegen.
> {noformat}
> scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF()
> df: org.apache.spark.sql.DataFrame = [_1: double, _2: double]
> scala> df.show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> scala> df.filter("_1 <=> _2").show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> {noformat}
> The result should be empty but the result remains two rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24015) Does method BackendReady.isReady is really well implemented

2018-04-18 Thread chenk (JIRA)

chenk created SPARK-24015:
-

 Summary: Does method BackendReady.isReady is really well 
implemented
 Key: SPARK-24015
 URL: https://issues.apache.org/jira/browse/SPARK-24015
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.3.0
Reporter: chenk






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442568#comment-16442568
 ] 

Wenchen Fan edited comment on SPARK-23989 at 4/18/18 2:47 PM:
--

OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` 
doesn't catch all the cases, so we may produce wrong result.


was (Author: cloud_fan):
OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` 
doesn't catch all the cases, so we may reproduce wrong result.

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23989:


Assignee: (was: Apache Spark)

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23989:


Assignee: Apache Spark

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442620#comment-16442620
 ] 

Apache Spark commented on SPARK-23989:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21101

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24012:


Assignee: Apache Spark

> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Assignee: Apache Spark
>Priority: Major
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24012:


Assignee: (was: Apache Spark)

> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Priority: Major
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442615#comment-16442615
 ] 

Apache Spark commented on SPARK-24012:
--

User 'liutang123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21100

> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Priority: Major
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Lijia Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijia Liu updated SPARK-24012:
--
Description: 
Union of map and other compatible column result in unresolved operator 'Union; 
exception

Reproduction

spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1

Output:

Error in query: unresolved operator 'Union;;
'Union
:- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
: +- OneRowRelation$
+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
 +- OneRowRelation$

 

  was:
Union of map and other compatible column result in unresolved operator 'Union; 
exception

Reproduction

spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1

Output:

'Union

:- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656

:  +- OneRowRelation

+- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657

   +- OneRowRelation

 


> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Priority: Major
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> Error in query: unresolved operator 'Union;;
> 'Union
> :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
> : +- OneRowRelation$
> +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
>  +- OneRowRelation$
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23875) Create IndexedSeq wrapper for ArrayData

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442602#comment-16442602
 ] 

Apache Spark commented on SPARK-23875:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21099

> Create IndexedSeq wrapper for ArrayData
> ---
>
> Key: SPARK-23875
> URL: https://issues.apache.org/jira/browse/SPARK-23875
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We don't have a good way to sequentially access {{UnsafeArrayData}} with a 
> common interface such as Seq. An example is {{MapObject}} where we need to 
> access several sequence collection types together. But {{UnsafeArrayData}} 
> doesn't implement {{ArrayData.array}}. Calling {{toArray}} will copy the 
> entire array. We can provide an {{IndexedSeq}} wrapper for {{ArrayData}}, so 
> we can avoid copying the entire array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442568#comment-16442568
 ] 

Wenchen Fan commented on SPARK-23989:
-

OK now I see the problem, `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` 
doesn't catch all the cases, so we may reproduce wrong result.

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24014:


Assignee: (was: Apache Spark)

> Add onStreamingStarted method to StreamingListener
> --
>
> Key: SPARK-24014
> URL: https://issues.apache.org/jira/browse/SPARK-24014
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> The {{StreamingListener}} in PySpark side seems to be lack of 
> {{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442553#comment-16442553
 ] 

Apache Spark commented on SPARK-24014:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21098

> Add onStreamingStarted method to StreamingListener
> --
>
> Key: SPARK-24014
> URL: https://issues.apache.org/jira/browse/SPARK-24014
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> The {{StreamingListener}} in PySpark side seems to be lack of 
> {{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24014:


Assignee: Apache Spark

> Add onStreamingStarted method to StreamingListener
> --
>
> Key: SPARK-24014
> URL: https://issues.apache.org/jira/browse/SPARK-24014
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Trivial
>
> The {{StreamingListener}} in PySpark side seems to be lack of 
> {{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24014) Add onStreamingStarted method to StreamingListener

2018-04-18 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-24014:
---

 Summary: Add onStreamingStarted method to StreamingListener
 Key: SPARK-24014
 URL: https://issues.apache.org/jira/browse/SPARK-24014
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


The {{StreamingListener}} in PySpark side seems to be lack of 
{{onStreamingStarted}} method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442514#comment-16442514
 ] 

Wenchen Fan commented on SPARK-23989:
-

it goes to `SortShuffleWriter` and then? We get the correct result, don't we?

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-18 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442488#comment-16442488
 ] 

Jacek Laskowski commented on SPARK-23830:
-

It's about how easy it is to find out that the issue is `class` vs `object`. If 
that's just a single change that could be reported to end users to help them I 
think it's worth it.

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-18 Thread Juliusz Sompolski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442477#comment-16442477
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

This hits when trying to create histogram statistics (SPARK-21975) on columns 
like monotonically increasing id - histograms cannot be created in reasonable 
time.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-18 Thread Juliusz Sompolski (JIRA)

Juliusz Sompolski created SPARK-24013:
-

 Summary: ApproximatePercentile grinds to a halt on sorted input.
 Key: SPARK-24013
 URL: https://issues.apache.org/jira/browse/SPARK-24013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


Running
{code}
sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid from 
range(1000))").collect()
{code}
takes 7 seconds, while
{code}
sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
{code}
grinds to a halt - processes the first million rows quickly, and then slows 
down to a few thousands rows / second (4m rows processed after 20 minutes).

Thread dumps show that it spends time in QuantileSummary.compress.
Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-18 Thread assia ydroudj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442379#comment-16442379
 ] 

assia ydroudj commented on SPARK-23206:
---

[~elu], thank you for the shared dos, it works for me

Is there a final PR to get the executor metrics ?

 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-18 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442370#comment-16442370
 ] 

Weichen Xu commented on SPARK-23990:


[~josephkb] I agree that option 2b looks better. Updating PR...

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442368#comment-16442368
 ] 

Apache Spark commented on SPARK-14682:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21097

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14682:


Assignee: (was: Apache Spark)

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14682:


Assignee: Apache Spark

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Lijia Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijia Liu updated SPARK-24012:
--
Description: 
Union of map and other compatible column result in unresolved operator 'Union; 
exception

Reproduction

spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1

Output:

'Union

:- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656

:  +- OneRowRelation

+- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657

   +- OneRowRelation

 

  was:
Union of map and other compatible column result in unresolved operator 'Union; 
exception

 

Reproduction

spark-sql>*select map(1,2), 'str' union all select map(1,2,3,null), 1*

Output:

'Union

:- Project [map(1, 2) AS map(1, 2)#6655, str AS str#6656]

:  +- OneRowRelation

+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657]

   +- OneRowRelation


> Union of map and other compatible column
> 
>
> Key: SPARK-24012
> URL: https://issues.apache.org/jira/browse/SPARK-24012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Lijia Liu
>Priority: Major
>
> Union of map and other compatible column result in unresolved operator 
> 'Union; exception
> Reproduction
> spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1
> Output:
> 'Union
> :- Project map(1, 2) AS map(1, 2)#6655, str AS str#6656
> :  +- OneRowRelation
> +- Project map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
> INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657
>    +- OneRowRelation
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24012) Union of map and other compatible column

2018-04-18 Thread Lijia Liu (JIRA)

Lijia Liu created SPARK-24012:
-

 Summary: Union of map and other compatible column
 Key: SPARK-24012
 URL: https://issues.apache.org/jira/browse/SPARK-24012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.1.0
Reporter: Lijia Liu


Union of map and other compatible column result in unresolved operator 'Union; 
exception

 

Reproduction

spark-sql>*select map(1,2), 'str' union all select map(1,2,3,null), 1*

Output:

'Union

:- Project [map(1, 2) AS map(1, 2)#6655, str AS str#6656]

:  +- OneRowRelation

+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS 
INT), 3, CAST(NULL AS INT))#6658, 1 AS 1#6657]

   +- OneRowRelation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442259#comment-16442259
 ] 

Apache Spark commented on SPARK-24011:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/21096

> Cache rdd's immediate parent ShuffleDependencies to accelerate 
> getShuffleDependencies()
> ---
>
> Key: SPARK-24011
> URL: https://issues.apache.org/jira/browse/SPARK-24011
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Minor
>
> When creating stages for jobs, we need to find a rdd's (except the final rdd) 
> immediate parent ShuffleDependencies by method getShuffleDependencies() for 
> at least 2 times (first in
> getMissingAncestorShuffleDependencies(), and second in 
> getOrCreateParentStages()).
> So, we can cache the result at the fist time we call getShuffleDependencies().
> This is helpful for cutting time consuming when there's many 
> NarrowDependencies between the rdd and its immediate parent 
> ShuffleDependencies or if the rdd has a number of immediate parent 
> ShuffleDependencies .
>  
> There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
> immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24011:


Assignee: Apache Spark

> Cache rdd's immediate parent ShuffleDependencies to accelerate 
> getShuffleDependencies()
> ---
>
> Key: SPARK-24011
> URL: https://issues.apache.org/jira/browse/SPARK-24011
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Minor
>
> When creating stages for jobs, we need to find a rdd's (except the final rdd) 
> immediate parent ShuffleDependencies by method getShuffleDependencies() for 
> at least 2 times (first in
> getMissingAncestorShuffleDependencies(), and second in 
> getOrCreateParentStages()).
> So, we can cache the result at the fist time we call getShuffleDependencies().
> This is helpful for cutting time consuming when there's many 
> NarrowDependencies between the rdd and its immediate parent 
> ShuffleDependencies or if the rdd has a number of immediate parent 
> ShuffleDependencies .
>  
> There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
> immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()

2018-04-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24011:


Assignee: (was: Apache Spark)

> Cache rdd's immediate parent ShuffleDependencies to accelerate 
> getShuffleDependencies()
> ---
>
> Key: SPARK-24011
> URL: https://issues.apache.org/jira/browse/SPARK-24011
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Minor
>
> When creating stages for jobs, we need to find a rdd's (except the final rdd) 
> immediate parent ShuffleDependencies by method getShuffleDependencies() for 
> at least 2 times (first in
> getMissingAncestorShuffleDependencies(), and second in 
> getOrCreateParentStages()).
> So, we can cache the result at the fist time we call getShuffleDependencies().
> This is helpful for cutting time consuming when there's many 
> NarrowDependencies between the rdd and its immediate parent 
> ShuffleDependencies or if the rdd has a number of immediate parent 
> ShuffleDependencies .
>  
> There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
> immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24011) Cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies()

2018-04-18 Thread wuyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-24011:
-
Summary: Cache rdd's immediate parent ShuffleDependencies to accelerate 
getShuffleDependencies()  (was: Cache rdd's immediate parent ShuffleDependency 
to accelerate getShuffleDependencies())

> Cache rdd's immediate parent ShuffleDependencies to accelerate 
> getShuffleDependencies()
> ---
>
> Key: SPARK-24011
> URL: https://issues.apache.org/jira/browse/SPARK-24011
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Minor
>
> When creating stages for jobs, we need to find a rdd's (except the final rdd) 
> immediate parent ShuffleDependencies by method getShuffleDependencies() for 
> at least 2 times (first in
> getMissingAncestorShuffleDependencies(), and second in 
> getOrCreateParentStages()).
> So, we can cache the result at the fist time we call getShuffleDependencies().
> This is helpful for cutting time consuming when there's many 
> NarrowDependencies between the rdd and its immediate parent 
> ShuffleDependencies or if the rdd has a number of immediate parent 
> ShuffleDependencies .
>  
> There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
> immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24011) Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies()

2018-04-18 Thread wuyi (JIRA)

wuyi created SPARK-24011:


 Summary: Cache rdd's immediate parent ShuffleDependency to 
accelerate getShuffleDependencies()
 Key: SPARK-24011
 URL: https://issues.apache.org/jira/browse/SPARK-24011
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: wuyi


When creating stages for jobs, we need to find a rdd's (except the final rdd) 
immediate parent ShuffleDependencies by method getShuffleDependencies() for at 
least 2 times (first in

getMissingAncestorShuffleDependencies(), and second in 
getOrCreateParentStages()).

So, we can cache the result at the fist time we call getShuffleDependencies().

This is helpful for cutting time consuming when there's many NarrowDependencies 
between the rdd and its immediate parent ShuffleDependencies or if the rdd has 
a number of immediate parent ShuffleDependencies .

 

There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-18 Thread Xianjin YE (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442224#comment-16442224
 ] 

Xianjin YE commented on SPARK-24006:


Haven't launch a large enough job to confirm my assumption..

But roughly I guess 5_000-10_000 executors would show some impact, and 100_000 
and above executors would cause an actual problem..

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23926) High-order function: reverse(x) → array

2018-04-18 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23926.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21034
[https://github.com/apache/spark/pull/21034]

> High-order function: reverse(x) → array
> ---
>
> Key: SPARK-23926
> URL: https://issues.apache.org/jira/browse/SPARK-23926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array which has the reversed order of array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array

2018-04-18 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23926:
-

Assignee: Marek Novotny

> High-order function: reverse(x) → array
> ---
>
> Key: SPARK-23926
> URL: https://issues.apache.org/jira/browse/SPARK-23926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array which has the reversed order of array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176
 ] 

liuxian edited comment on SPARK-23989 at 4/18/18 9:21 AM:
--

test({color:#6a8759}"groupBy"{color}) {
 {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}

{color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},{color}
 {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 


was (Author: 10110346):
test({color:#6a8759}"groupBy"{color}) {
{color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 
16777217){color}{color:#808080}
{color} {color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

 checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},
{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176
 ] 

liuxian commented on SPARK-23989:
-

test({color:#6a8759}"groupBy"{color}) {
{color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 
16777217){color}{color:#808080}
{color} {color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

 checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},
{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-18 Thread ANDY GUAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442165#comment-16442165
 ] 

ANDY GUAN commented on SPARK-24009:
---

Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in 
hive-site.xml. Make sure your current user has the right to write into the 
configured directory.

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 3.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write hdfs successful
> 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  on yarn writr local directory failed
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>  ... 8 more
>  Caused by: java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>  at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread ANDY GUAN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANDY GUAN updated SPARK-24008:
--
Comment: was deleted

(was: Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in 
hive-site.xml. Make sure your current user has the right to write into the 
configured directory.)

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-18 Thread ANDY GUAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442164#comment-16442164
 ] 

ANDY GUAN commented on SPARK-24008:
---

Try to config *hive.exec.scratchdir* and *hive.exec.stagingdir* in 
hive-site.xml. Make sure your current user has the right to write into the 
configured directory.

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled

2018-04-18 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442146#comment-16442146
 ] 

Rui Li commented on SPARK-24010:


Hi [~rxin], I think checking databaseExists is added in SPARK-14869. Do you 
remember why we need to check whether DB exists? In {{HiveClientImpl:: 
tableExists}}, we're telling Hive not to throw exceptions, so there shouldn't 
be exceptions if DB doesn't exist.

> Select from table needs read access on DB folder when storage based auth is 
> enabled
> ---
>
> Key: SPARK-24010
> URL: https://issues.apache.org/jira/browse/SPARK-24010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> When HMS enables storage based authorization, SparkSQL requires read access 
> on DB folder in order to select from a table. Such requirement doesn't seem 
> necessary and is not required in Hive.
> The reason is when Analyzer tries to resolve a relation, it calls 
> [SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469].
>  This will call the metastore get_database API which will perform 
> authorization check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-18 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442147#comment-16442147
 ] 

Hyukjin Kwon commented on SPARK-24006:
--

How many executors would you guess cause an actual problem roughly?

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24010) Select from table needs read access on DB folder when storage based auth is enabled

2018-04-18 Thread Rui Li (JIRA)

Rui Li created SPARK-24010:
--

 Summary: Select from table needs read access on DB folder when 
storage based auth is enabled
 Key: SPARK-24010
 URL: https://issues.apache.org/jira/browse/SPARK-24010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Rui Li


When HMS enables storage based authorization, SparkSQL requires read access on 
DB folder in order to select from a table. Such requirement doesn't seem 
necessary and is not required in Hive.
The reason is when Analyzer tries to resolve a relation, it calls 
[SessionCatalog::databaseExists|https://github.com/apache/spark/blob/v2.1.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L469].
 This will call the metastore get_database API which will perform authorization 
check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-18 Thread chris_j (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris_j updated SPARK-24009:

Description: 
1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write local directory successful

3.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write hdfs successful

2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  on yarn writr local directory failed

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 ... 8 more
 Caused by: java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)

 

  was:
1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  successful

2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  failed

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0
 (exists=false, 
cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at

[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-04-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442050#comment-16442050
 ] 

Apache Spark commented on SPARK-23529:
--

User 'madanadit' has created a pull request for this issue:
https://github.com/apache/spark/pull/21095

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-18 Thread chris_j (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris_j updated SPARK-24009:

Description: 
1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  successful

2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  failed

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0
 (exists=false, 
cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 ... 8 more
Caused by: java.io.IOException: Mkdirs failed to create 
file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0
 (exists=false, 
cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)

 

  was:INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited 
FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date


> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  successful
> 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  failed
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0
>  (exists=false, 
> cwd=file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
>

[jira] [Created] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-18 Thread chris_j (JIRA)

chris_j created SPARK-24009:
---

 Summary: spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab' 
 Key: SPARK-24009
 URL: https://issues.apache.org/jira/browse/SPARK-24009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: chris_j


INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited 
FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441997#comment-16441997
 ] 

Wenchen Fan commented on SPARK-23989:
-

You have to provide an end-to-end use case to convince us this is a bug(e.g. a 
SQL/DataFrame query). If you fork Spark, change some code and break something, 
it is not a bug.

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23989:

Attachment: (was: 无标题2.png)

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 109 matches

Mail list logo