[jira] [Closed] (SPARK-22057) Provide to remove the applications in the HistoryPage which the event log files have not been existed.

2017-09-20 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing closed SPARK-22057.
---
Resolution: Later

see [https://issues.apache.org/jira/browse/SPARK-20642]
i think we can wait for  the linked change finished.

> Provide to remove the applications in the HistoryPage which the event log 
> files have not been existed.
> --
>
> Key: SPARK-22057
> URL: https://issues.apache.org/jira/browse/SPARK-22057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Linux
>Reporter: zuotingbing
>Priority: Minor
>
> 1. There are several event log 
> files(app-20170918104353-0003,app-20170918105022-0005,app-20170919160046-0006)
>  under the event log directory.
> 2. delete the log files of app-20170918104353-0003  under the event log 
> directory.
> 3. check the History Page, should not display the app-20170918104353-0003 
> since the event log files has not been existed anymore.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14878) Support Trim characters in the string trim function

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174316#comment-16174316
 ] 

Apache Spark commented on SPARK-14878:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/19302

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>Assignee: kevin yu
> Fix For: 2.3.0
>
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> {noformat}
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22084:


Assignee: (was: Apache Spark)

> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>  Labels: performance
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174314#comment-16174314
 ] 

Apache Spark commented on SPARK-22084:
--

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/19301

> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>  Labels: performance
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22084:


Assignee: Apache Spark

> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>Assignee: Apache Spark
>  Labels: performance
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-20 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-21934:
---

Assignee: Saisai Shao

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-20 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-21934.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19160
[https://github.com/apache/spark/pull/19160]

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
> Fix For: 2.3.0
>
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-22084:
-
Labels: performance  (was: )

> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>  Labels: performance
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-22084:
-
Description: 
{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

{code}
== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]
{code}

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]

  was:
{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]


> Performance regression in aggregation strategy
> --
>
> Key: SPARK-22084
> URL: https://issues.apache.org/jira/browse/SPARK-22084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>
> {code:sql}
> SELECT a, SUM(b) AS b0, SUM(b) AS b1 
> FROM VALUES(1, 1), (2, 2) AS (a, b) 
> GROUP BY a
> {code}
> Two exactly the same SUM(b) expressions in the SQL, and the following is the 
> physical plan in Spark 2.x.
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
> sum(cast(b#12 as bigint))])
> +- Exchange hashpartitioning(a#11, 200)
>+- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
> bigint)), partial_sum(cast(b#12 as bigint))])
>   +- LocalTableScan [a#11, b#12]
> {code}
> functions in Aggregate should be: functions=[partial_sum(cast(b#12 as 
> bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22084) Performance regression in aggregation strategy

2017-09-20 Thread StanZhai (JIRA)
StanZhai created SPARK-22084:


 Summary: Performance regression in aggregation strategy
 Key: SPARK-22084
 URL: https://issues.apache.org/jira/browse/SPARK-22084
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: StanZhai


{code:sql}
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
{code}

Two exactly the same SUM(b) expressions in the SQL, and the following is the 
physical plan in Spark 2.x.

== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 
as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), 
partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]

functions in Aggregate should be: functions=[partial_sum(cast(b#12 as bigint))]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-09-20 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-22083:
-
Summary: When dropping multiple blocks to disk, Spark should release all 
locks on a failure  (was: When dropping multiple locks to disk, Spark should 
release all locks on a failure)

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-20 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174247#comment-16174247
 ] 

Imran Rashid commented on SPARK-21928:
--

I believe I figured out why things get stuck sometimes, filed SPARK-22083.  
[~jbrock], if you happen to still have logs from a case where you see this, can 
you check if the stuck executor shows something like "INFO MemoryStore: 2 
blocks selected for dropping" (or maybe more than 2)?

> ML LogisticRegression training occasionally produces 
> java.lang.ClassNotFoundException when attempting to load custom Kryo 
> registrator class
> ---
>
> Key: SPARK-21928
> URL: https://issues.apache.org/jira/browse/SPARK-21928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: John Brock
>
> I unfortunately can't reliably reproduce this bug; it happens only 
> occasionally, when training a logistic regression model with very large 
> datasets. The training will often proceed through several {{treeAggregate}} 
> calls without any problems, and then suddenly workers will start running into 
> this {{java.lang.ClassNotFoundException}}.
> After doing some debugging, it seems that whenever this error happens, Spark 
> is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} 
> instance instead of the usual 
> {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} 
> can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't.
> When this error does pop up, it's usually accompanied by the task seeming to 
> hang, and I need to kill Spark manually.
> I'm running a Spark application in cluster mode via spark-submit, and I have 
> a custom Kryo registrator. The JAR is built with {{sbt assembly}}.
> Exception message:
> {noformat}
> 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
> StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
> /10.0.29.65:34332
> org.apache.spark.SparkException: Failed to register classes with Kryo
> at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
> at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
> at 
> org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
> at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
> at 
> org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
> at 
> org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
> at 
> org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
> at 
> org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
> at 
> org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
> at 
> org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
> at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353)
> at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)
> 

[jira] [Created] (SPARK-22083) When dropping multiple locks to disk, Spark should release all locks on a failure

2017-09-20 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-22083:


 Summary: When dropping multiple locks to disk, Spark should 
release all locks on a failure
 Key: SPARK-22083
 URL: https://issues.apache.org/jira/browse/SPARK-22083
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 2.2.0, 2.1.1
Reporter: Imran Rashid


{{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all the 
blocks it intends to evict | 
https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
  However, if there is an exception while dropping blocks, there is no 
{{finally}} block to release all the locks.

If there is only block being dropped, this isn't a problem (probably).  Usually 
the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> dropBlocks 
--> BlockManager.dropFromMemory --> DiskStore.put}}.  And {{DiskStore.put}} 
does do a [{{removeBlock()}} in a {{finally}} 
block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
 which cleans up the locks.

I ran into this from the serialization issue in SPARK-21928.  In that, a netty 
thread ends up trying to evict some blocks from memory to disk, and fails.  
When there is only block that needs to be evicted, and the error occurs, there 
isn't any real problem; I assume that netty thread is dead, but the executor 
threads seem fine.  However, in the cases where two blocks get dropped, one 
task gets completely stuck.  Unfortunately I don't have a stack trace from the 
stuck executor, but I assume it just waits forever on this lock that never gets 
released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22070) Spark SQL filter comparisons failing with timestamps and ISO-8601 strings

2017-09-20 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174220#comment-16174220
 ] 

Yuming Wang commented on SPARK-22070:
-

cc [~smilegator]

> Spark SQL filter comparisons failing with timestamps and ISO-8601 strings
> -
>
> Key: SPARK-22070
> URL: https://issues.apache.org/jira/browse/SPARK-22070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Vishal Doshi
>Priority: Minor
>
> Filter behavior seems like it's ignoring time in the ISO-8601 string. See 
> below for code to reproduce:
> {code}
> import datetime
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, TimestampType
> spark = SparkSession.builder.getOrCreate()
> data = [{"dates": datetime.datetime(2017, 1, 1, 12)}]
> schema = StructType([StructField("dates", TimestampType())])
> df = spark.createDataFrame(data, schema=schema)
> # df.head() returns (correctly):
> #   Row(dates=datetime.datetime(2017, 1, 1, 12, 0))
> df.filter(df["dates"] > datetime.datetime(2017, 1, 1, 11).isoformat()).count()
> # should return 1, instead returns 0
> # datetime.datetime(2017, 1, 1, 11).isoformat() returns '2017-01-01T11:00:00'
> df.filter(df["dates"] > datetime.datetime(2016, 12, 31, 
> 11).isoformat()).count()
> # this one works
> {code}
> Of course, the simple work around is to use the datetime objects themselves 
> in the query expression, but in practice, this means using dateutil to parse 
> some data, which is not ideal.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22070) Spark SQL filter comparisons failing with timestamps and ISO-8601 strings

2017-09-20 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174212#comment-16174212
 ] 

Yuming Wang commented on SPARK-22070:
-

This can be fixed by my PR: https://github.com/apache/spark/pull/18853



{noformat}
Yumings-MacBook-Pro:spark-2.3.0-SNAPSHOT-bin-2.6.5 ming$ bin/pyspark  --conf 
spark.sql.typeCoercion.mode=hive
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
17/09/21 11:27:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Python version 2.7.10 (default, Feb  7 2017 00:08:15)
SparkSession available as 'spark'.
>>> import datetime
>>> 
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import StructType, StructField, TimestampType
>>> 
>>> spark = SparkSession.builder.getOrCreate()
>>> 
>>> data = [{"dates": datetime.datetime(2017, 1, 1, 12)}]
>>> schema = StructType([StructField("dates", TimestampType())])
>>> df = spark.createDataFrame(data, schema=schema)
17/09/21 11:27:35 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
>>> 
>>> df.filter(df["dates"] > datetime.datetime(2017, 1, 1, 
>>> 11).isoformat()).count()
1   
>>> 
{noformat}





> Spark SQL filter comparisons failing with timestamps and ISO-8601 strings
> -
>
> Key: SPARK-22070
> URL: https://issues.apache.org/jira/browse/SPARK-22070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Vishal Doshi
>Priority: Minor
>
> Filter behavior seems like it's ignoring time in the ISO-8601 string. See 
> below for code to reproduce:
> {code}
> import datetime
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, TimestampType
> spark = SparkSession.builder.getOrCreate()
> data = [{"dates": datetime.datetime(2017, 1, 1, 12)}]
> schema = StructType([StructField("dates", TimestampType())])
> df = spark.createDataFrame(data, schema=schema)
> # df.head() returns (correctly):
> #   Row(dates=datetime.datetime(2017, 1, 1, 12, 0))
> df.filter(df["dates"] > datetime.datetime(2017, 1, 1, 11).isoformat()).count()
> # should return 1, instead returns 0
> # datetime.datetime(2017, 1, 1, 11).isoformat() returns '2017-01-01T11:00:00'
> df.filter(df["dates"] > datetime.datetime(2016, 12, 31, 
> 11).isoformat()).count()
> # this one works
> {code}
> Of course, the simple work around is to use the datetime objects themselves 
> in the query expression, but in practice, this means using dateutil to parse 
> some data, which is not ideal.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22082.
---
Resolution: Invalid

As mentioned before, don't make JIRAs for a single typo, the most trivial of 
possible changes

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22082:


Assignee: Apache Spark

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174132#comment-16174132
 ] 

Apache Spark commented on SPARK-22082:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/19300

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22082:


Assignee: (was: Apache Spark)

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22082:

Component/s: (was: MLlib)
 SparkR

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22082:

Description: 
"choosen" should be "chosen" in API doc of R.
http://spark.apache.org/docs/latest/api/R/index.html , see 
{code:java}
spark.kmeans
{code}


  was:
"choosen" should be "chosen" in API doc of R and source codes.
http://spark.apache.org/docs/latest/api/R/index.html , see 
{code:java}
spark.kmeans
{code}



> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22082:

Attachment: 2017-09-21_100940.png

> Spelling mistake: choosen in API doc of R 
> --
>
> Key: SPARK-22082
> URL: https://issues.apache.org/jira/browse/SPARK-22082
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Trivial
> Attachments: 2017-09-21_100940.png
>
>
> "choosen" should be "chosen" in API doc of R and source codes.
> http://spark.apache.org/docs/latest/api/R/index.html , see 
> {code:java}
> spark.kmeans
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22082) Spelling mistake: choosen in API doc of R

2017-09-20 Thread zuotingbing (JIRA)
zuotingbing created SPARK-22082:
---

 Summary: Spelling mistake: choosen in API doc of R 
 Key: SPARK-22082
 URL: https://issues.apache.org/jira/browse/SPARK-22082
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.0
Reporter: zuotingbing
Priority: Trivial


"choosen" should be "chosen" in API doc of R and source codes.
http://spark.apache.org/docs/latest/api/R/index.html , see 
{code:java}
spark.kmeans
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22076) Expand.projections should not be a Stream

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174040#comment-16174040
 ] 

Apache Spark commented on SPARK-22076:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19298

> Expand.projections should not be a Stream
> -
>
> Key: SPARK-22076
> URL: https://issues.apache.org/jira/browse/SPARK-22076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21384) Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails

2017-09-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21384.

   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.3.0
   2.2.1

> Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails
> ---
>
> Key: SPARK-21384
> URL: https://issues.apache.org/jira/browse/SPARK-21384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: holdenk
>Assignee: Devaraj K
> Fix For: 2.2.1, 2.3.0
>
>
> In making the updated version of Spark 2.2 + YARN it seems that the auto 
> packaging of JARS based on SPARK_HOME isn't quite working (which results in a 
> warning anyways). You can see the build failure in travis at 
> https://travis-ci.org/holdenk/spark-testing-base/builds/252656109 (I've 
> reproed it locally).
> This results in an exception like:
> {code}
> 17/07/12 03:14:11 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/07/12 03:14:11 WARN NMAuditLogger: USER=travis OPERATION=Container 
> Finished - Failed   TARGET=ContainerImplRESULT=FAILURE  
> DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
> APPID=application_1499829231193_0001
> CONTAINERID=container_1499829231193_0001_01_01
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11]
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11_tmp]
> 17/07/12 03:14:13 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDow

[jira] [Updated] (SPARK-22081) Generalized Reduced Error Logistic Regression

2017-09-20 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22081:
--
Description: 
Our SAS gurus are saying they would love to have "Generalized Reduced Error 
Logistic Regression" before they would consider moving over to Spark ML.

http://www.riceanalytics.com/db3/00232/riceanalytics.com/_download/ricejsm2008proceedingspaper.pdf

We have a lot of production modeling built on this methodology.

!RELR.GIF!



  was:
Our SAS gurus are saying they would love to have "Generalized Reduced Error 
Logistic Regression" before they would consider moving over to Spark ML.

http://www.riceanalytics.com/db3/00232/riceanalytics.com/_download/ricejsm2008proceedingspaper.pdf

We have a lot of production modeling built on this methodology.




> Generalized Reduced Error Logistic Regression
> -
>
> Key: SPARK-22081
> URL: https://issues.apache.org/jira/browse/SPARK-22081
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: machine_learning, ml, mllib
> Attachments: RELR.GIF
>
>
> Our SAS gurus are saying they would love to have "Generalized Reduced Error 
> Logistic Regression" before they would consider moving over to Spark ML.
> http://www.riceanalytics.com/db3/00232/riceanalytics.com/_download/ricejsm2008proceedingspaper.pdf
> We have a lot of production modeling built on this methodology.
> !RELR.GIF!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22081) Generalized Reduced Error Logistic Regression

2017-09-20 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22081:
--
Attachment: RELR.GIF

> Generalized Reduced Error Logistic Regression
> -
>
> Key: SPARK-22081
> URL: https://issues.apache.org/jira/browse/SPARK-22081
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: machine_learning, ml, mllib
> Attachments: RELR.GIF
>
>
> Our SAS gurus are saying they would love to have "Generalized Reduced Error 
> Logistic Regression" before they would consider moving over to Spark ML.
> http://www.riceanalytics.com/db3/00232/riceanalytics.com/_download/ricejsm2008proceedingspaper.pdf
> We have a lot of production modeling built on this methodology.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173900#comment-16173900
 ] 

Apache Spark commented on SPARK-18838:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19297

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22081) Generalized Reduced Error Logistic Regression

2017-09-20 Thread Ruslan Dautkhanov (JIRA)
Ruslan Dautkhanov created SPARK-22081:
-

 Summary: Generalized Reduced Error Logistic Regression
 Key: SPARK-22081
 URL: https://issues.apache.org/jira/browse/SPARK-22081
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.2.0, 2.3.0
Reporter: Ruslan Dautkhanov


Our SAS gurus are saying they would love to have "Generalized Reduced Error 
Logistic Regression" before they would consider moving over to Spark ML.

http://www.riceanalytics.com/db3/00232/riceanalytics.com/_download/ricejsm2008proceedingspaper.pdf

We have a lot of production modeling built on this methodology.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Sergey Zhemzhitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173755#comment-16173755
 ] 

Sergey Zhemzhitsky commented on SPARK-21549:


[~ste...@apache.org] I've updated PR to prevent using FileSystems at all. 
Instead, there is just an additional check whether there are absolute files to 
rename during commit.

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12823) Cannot create UDF with StructType input

2017-09-20 Thread Simeon H.K. Fitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173702#comment-16173702
 ] 

Simeon H.K. Fitch edited comment on SPARK-12823 at 9/20/17 7:16 PM:


As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++

  println(ds.select(litValue().as[KV]).first)
  // KV(4,four)

{code}

So there's a weird asymmetry to it as well.


was (Author: metasim):
As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++
{code}

So there's a weird asymmetry to it as well.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12823) Cannot create UDF with StructType input

2017-09-20 Thread Simeon H.K. Fitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173702#comment-16173702
 ] 

Simeon H.K. Fitch commented on SPARK-12823:
---

As a coda, it's interesting to note that you can *return* a `Product` type from 
a UDF, and the conversion works fine:

{code:java}
  val litValue = udf(() ⇒ KV(4l, "four"))

  ds.select(litValue()).show
  //  ++
  //  |   UDF()|
  //  ++
  //  |[4,four]|
  //  |[4,four]|
  //  ++
{code}

So there's a weird asymmetry to it as well.

> Cannot create UDF with StructType input
> ---
>
> Key: SPARK-12823
> URL: https://issues.apache.org/jira/browse/SPARK-12823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Frank Rosner
>
> h5. Problem
> It is not possible to apply a UDF to a column that has a struct data type. 
> Two previous requests to the mailing list remained unanswered.
> h5. How-To-Reproduce
> {code}
> val sql = new org.apache.spark.sql.SQLContext(sc)
> import sql.implicits._
> case class KV(key: Long, value: String)
> case class Row(kv: KV)
> val df = sc.parallelize(List(Row(KV(1L, "a")), Row(KV(5L, "b".toDF
> val udf1 = org.apache.spark.sql.functions.udf((kv: KV) => kv.value)
> df.select(udf1(df("kv"))).show
> // java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to $line78.$read$$iwC$$iwC$KV
> val udf2 = org.apache.spark.sql.functions.udf((kv: (Long, String)) => kv._2)
> df.select(udf2(df("kv"))).show
> // org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(kv)' due to 
> data type mismatch: argument 1 requires struct<_1:bigint,_2:string> type, 
> however, 'kv' is of struct type.;
> {code}
> h5. Mailing List Entries
> - 
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACUahd8M=ipCbFCYDyein_=vqyoantn-tpxe6sq395nh10g...@mail.gmail.com%3E
> - https://www.mail-archive.com/user@spark.apache.org/msg43092.html
> h5. Possible Workaround
> If you create a {{UserDefinedFunction}} manually, not using the {{udf}} 
> helper functions, it works. See https://github.com/FRosner/struct-udf, which 
> exposes the {{UserDefinedFunction}} constructor (public from package 
> private). However, then you have to work with a {{Row}}, because it does not 
> automatically convert the row to a case class / tuple.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-09-20 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173701#comment-16173701
 ] 

Imran Rashid commented on SPARK-12297:
--

In case anyone took a look at the design doc I posted a few days ago, I just 
made some changes after posting the implementation.  In particular I changed 
the proposal for csv & json to be more consistent across all formats, which I 
believe is consistent with the earlier discussion.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-09-20 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14371:
--
Shepherd: Joseph K. Bradley

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173449#comment-16173449
 ] 

Steve Loughran commented on SPARK-21549:


The {{newTaskTempFileAbsPath()}} method is an interesting spot of code...I'm 
still trying to work out when it is actually used. Some committers like 
{{ManifestFileCommitProtocol}} don't support it all. 

However, if it is used, then your patch is going to cause problems if the dest 
FS != the default FS, because then the bit of the protocol which takes that 
list of temp files and renames() them into their destination is going to fail. 
I think you'd be better off having the committer fail fast when an absolute 
path is asked for

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22076) Expand.projections should not be a Stream

2017-09-20 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22076.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Expand.projections should not be a Stream
> -
>
> Key: SPARK-22076
> URL: https://issues.apache.org/jira/browse/SPARK-22076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Sathiya Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173313#comment-16173313
 ] 

Sathiya Kumar commented on SPARK-22080:
---

Here is a PR with the proposed changes: 
https://github.com/apache/spark/pull/19295

> Allow developers to add pre-optimisation rules
> --
>
> Key: SPARK-22080
> URL: https://issues.apache.org/jira/browse/SPARK-22080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> [SPARK-9843] added support for adding custom rules for optimising 
> LogicalPlan, but the added rules are only applied only after all the spark's 
> native rules are applied. Allowing users to plug in pre-optimisation rules 
> facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173309#comment-16173309
 ] 

Apache Spark commented on SPARK-22080:
--

User 'sathiyapk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19295

> Allow developers to add pre-optimisation rules
> --
>
> Key: SPARK-22080
> URL: https://issues.apache.org/jira/browse/SPARK-22080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> [SPARK-9843] added support for adding custom rules for optimising 
> LogicalPlan, but the added rules are only applied only after all the spark's 
> native rules are applied. Allowing users to plug in pre-optimisation rules 
> facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22080:


Assignee: Apache Spark

> Allow developers to add pre-optimisation rules
> --
>
> Key: SPARK-22080
> URL: https://issues.apache.org/jira/browse/SPARK-22080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> [SPARK-9843] added support for adding custom rules for optimising 
> LogicalPlan, but the added rules are only applied only after all the spark's 
> native rules are applied. Allowing users to plug in pre-optimisation rules 
> facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22080:


Assignee: (was: Apache Spark)

> Allow developers to add pre-optimisation rules
> --
>
> Key: SPARK-22080
> URL: https://issues.apache.org/jira/browse/SPARK-22080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> [SPARK-9843] added support for adding custom rules for optimising 
> LogicalPlan, but the added rules are only applied only after all the spark's 
> native rules are applied. Allowing users to plug in pre-optimisation rules 
> facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22072) Allow the same shell params to be used for all of the different steps in release-build

2017-09-20 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-22072:

Description: The jenkins script currently sets SPARK_VERSION to different 
values depending on what action is being performed. To simplify the scripts use 
SPARK_PACKAGE_VERSION in release-publish.  (was: The jenkins scripts currently 
do version rewriting to drop the leading v when triggered with publish-release, 
move that inside the release-build scripts.)

> Allow the same shell params to be used for all of the different steps in 
> release-build
> --
>
> Key: SPARK-22072
> URL: https://issues.apache.org/jira/browse/SPARK-22072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.2, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>
> The jenkins script currently sets SPARK_VERSION to different values depending 
> on what action is being performed. To simplify the scripts use 
> SPARK_PACKAGE_VERSION in release-publish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22072) Allow the same shell params to be used for all of the different steps in release-build

2017-09-20 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-22072:

Summary: Allow the same shell params to be used for all of the different 
steps in release-build  (was: Automatically rewrite the version string for 
publish-release)

> Allow the same shell params to be used for all of the different steps in 
> release-build
> --
>
> Key: SPARK-22072
> URL: https://issues.apache.org/jira/browse/SPARK-22072
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.2, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>
> The jenkins scripts currently do version rewriting to drop the leading v when 
> triggered with publish-release, move that inside the release-build scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Sathiya Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-22080:
--
Description: [SPARK-9843] added support for adding custom rules for 
optimising LogicalPlan, but the added rules are only applied only after all the 
spark's native rules are applied. Allowing users to plug in pre-optimisation 
rules facilitate some advanced optimisation.  (was: 
[SPARK-9843](https://issues.apache.org/jira/browse/SPARK-9843) added support 
for adding custom rules for optimising LogicalPlan, but the added rules are 
only applied only after all the spark's native rules are applied. Allowing 
users to plug in pre-optimisation rules facilitate some advanced optimisation.)

> Allow developers to add pre-optimisation rules
> --
>
> Key: SPARK-22080
> URL: https://issues.apache.org/jira/browse/SPARK-22080
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> [SPARK-9843] added support for adding custom rules for optimising 
> LogicalPlan, but the added rules are only applied only after all the spark's 
> native rules are applied. Allowing users to plug in pre-optimisation rules 
> facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22080) Allow developers to add pre-optimisation rules

2017-09-20 Thread Sathiya Kumar (JIRA)
Sathiya Kumar created SPARK-22080:
-

 Summary: Allow developers to add pre-optimisation rules
 Key: SPARK-22080
 URL: https://issues.apache.org/jira/browse/SPARK-22080
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0, 2.1.1
Reporter: Sathiya Kumar
Priority: Minor


[SPARK-9843](https://issues.apache.org/jira/browse/SPARK-9843) added support 
for adding custom rules for optimising LogicalPlan, but the added rules are 
only applied only after all the spark's native rules are applied. Allowing 
users to plug in pre-optimisation rules facilitate some advanced optimisation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173249#comment-16173249
 ] 

Sean Owen commented on SPARK-21935:
---

Yes it seems like your tasks need more memory than the configuration was giving 
them. Sounds like you need more like 4GB per 3 cores, in general. So you could 
allocate executors of that size with dynamic allocation (or a multiple or 
something) and should work out.

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Sergey Zhemzhitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173184#comment-16173184
 ] 

Sergey Zhemzhitsky commented on SPARK-21549:


[~mridulm80], [~WeiqingYang], [~ste...@apache.org], 

I've implemented the fix in this PR 
(https://github.com/apache/spark/pull/19294), which sets user's current working 
directory (which is typically her home directory in case of distributed 
filesystems) as output directory.

The patch allows using OutputFormats which write to external systems, 
databases, etc. by means of RDD API.
I far as I understand the requirement for output paths to be specified is only 
necessary to allow files to be committed to an absolute output location, that 
is not the case for output formats which write data to external systems. 
So using user's working directory for such situations seems to be ok.
 


> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173175#comment-16173175
 ] 

Apache Spark commented on SPARK-21549:
--

User 'szhem' has created a pull request for this issue:
https://github.com/apache/spark/pull/19294

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21549:


Assignee: (was: Apache Spark)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21549:


Assignee: Apache Spark

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Assignee: Apache Spark
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2017-09-20 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173139#comment-16173139
 ] 

peay commented on SPARK-10925:
--

Same issue on Spark 2.1.0. I have been working around this using 
`df.rdd.toDF()` before using `join`, but this is really far from ideal.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala, 
> TestCase.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

[jira] [Comment Edited] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-20 Thread Nikolaos Tsipas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173110#comment-16173110
 ] 

Nikolaos Tsipas edited comment on SPARK-21935 at 9/20/17 12:45 PM:
---

After a few trial and error iterations managed to find a set of config params 
(executors number, executor memory, memoryOverhead) that finishes the job 
without failing. In the process realised the following:
- It looks like our problem had to do with the amount of memory available per 
task in the executor. Specifically, on an 8 core 15GB slave node we were using 
1 executor with 8 tasks/cores, 4GB heap and 6GB off heap. This always ended 
with ExecutorLostFailure errors. After the number of tasks was reduced to 3 (of 
course we were underutilising cpu resources in this case) the job completed 
successfully. 
-- Theoretically the above could be also fixed by increasing parallelism and 
consequently reducing the amount of memory required per task but that approach 
didn't work when it was investigated.
- Dynamic allocation is based on the amount of MEMORY available on the slave 
nodes. i.e. if 15GB and 8 cores are available and you ask for 4 executors with 
1GB and 2 cores spark/yarn will give you back 15 executors with 2 cores and 1 
GB memory because it will try to optimise resources utilisation for memory 
(i.e. 1GB * 15 executors = 15GB). Of course you will be running 15*2=30 tasks 
in parallel which is too much for the 8 cores you have available. Keeping 
things balanced is a trial and error procedure. (Yarn can be tuned in spark to 
take into account the available cores as well and iirc I've done it in the past 
but I can't find it now)

[~srowen] any feedback on the above would be appreciated. Cheers. Nik


was (Author: nicktgr15):
After a few trial and error iterations managed to find a set of config params 
(executors number, executor memory, memoryOverhead) that finishes the job 
without failing. In the process realised the following:
- It looks like our problem had to do with the amount of memory available per 
task in the executor. Specifically, on an 8 core 15GB slave node we were using 
1 executor with 8 tasks/cores, 4GB heap and 6GB off heap. This always ended 
with ExecutorLostFailure errors. After the number of tasks was reduced to 3 (of 
course we were underutilising cpu resources in this case) the job completed 
successfully. 
-- Theoretically the above could be also fixed by increasing parallelism and 
consequently reducing the amount of memory required per task but that approach 
didn't work when it was investigated.
- Dynamic allocation is based on the amount of MEMORY available on the slave 
nodes. i.e. if 15GB and 8 cores are available and you ask for 4 executors with 
1GB and 2 cores spark/yarn will give you back 15 executors with 2 cores and 1 
GB memory because it will try to optimise resources utilisation for memory 
(i.e. 1GB * 15 executors = 15GB). Of course you will be running 15*2=30 tasks 
in parallel which is too much for the 8 cores you have available. Keeping 
things balanced is a trial and error procedure. (Yarn can be tuned in spark to 
take into account the available cores as well and iirc I've done it in the past 
but I can't find it now)

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.

[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-20 Thread Nikolaos Tsipas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173110#comment-16173110
 ] 

Nikolaos Tsipas commented on SPARK-21935:
-

After a few trial and error iterations managed to find a set of config params 
(executors number, executor memory, memoryOverhead) that finishes the job 
without failing. In the process realised the following:
- It looks like our problem had to do with the amount of memory available per 
task in the executor. Specifically, on an 8 core 15GB slave node we were using 
1 executor with 8 tasks/cores, 4GB heap and 6GB off heap. This always ended 
with ExecutorLostFailure errors. After the number of tasks was reduced to 3 (of 
course we were underutilising cpu resources in this case) the job completed 
successfully. 
-- Theoretically the above could be also fixed by increasing parallelism and 
consequently reducing the amount of memory required per task but that approach 
didn't work when it was investigated.
- Dynamic allocation is based on the amount of MEMORY available on the slave 
nodes. i.e. if 15GB and 8 cores are available and you ask for 4 executors with 
1GB and 2 cores spark/yarn will give you back 15 executors with 2 cores and 1 
GB memory because it will try to optimise resources utilisation for memory 
(i.e. 1GB * 15 executors = 15GB). Of course you will be running 15*2=30 tasks 
in parallel which is too much for the 8 cores you have available. Keeping 
things balanced is a trial and error procedure. (Yarn can be tuned in spark to 
take into account the available cores as well and iirc I've done it in the past 
but I can't find it now)

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22079) Serializer in HiveOutputWriter miss loading job configuration

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22079:


Assignee: (was: Apache Spark)

> Serializer in HiveOutputWriter miss loading job configuration
> -
>
> Key: SPARK-22079
> URL: https://issues.apache.org/jira/browse/SPARK-22079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lantao Jin
>
> Serializer in HiveOutputWriter misses loading job configuration. It will 
> failed the customized SerDe object which extends 
> {{org.apache.hadoop.hive.serde2.AbstractSerDe}} and overrides its 
> initialize(Configuration,...) method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22079) Serializer in HiveOutputWriter miss loading job configuration

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173057#comment-16173057
 ] 

Apache Spark commented on SPARK-22079:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19293

> Serializer in HiveOutputWriter miss loading job configuration
> -
>
> Key: SPARK-22079
> URL: https://issues.apache.org/jira/browse/SPARK-22079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lantao Jin
>
> Serializer in HiveOutputWriter misses loading job configuration. It will 
> failed the customized SerDe object which extends 
> {{org.apache.hadoop.hive.serde2.AbstractSerDe}} and overrides its 
> initialize(Configuration,...) method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22079) Serializer in HiveOutputWriter miss loading job configuration

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22079:


Assignee: Apache Spark

> Serializer in HiveOutputWriter miss loading job configuration
> -
>
> Key: SPARK-22079
> URL: https://issues.apache.org/jira/browse/SPARK-22079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>
> Serializer in HiveOutputWriter misses loading job configuration. It will 
> failed the customized SerDe object which extends 
> {{org.apache.hadoop.hive.serde2.AbstractSerDe}} and overrides its 
> initialize(Configuration,...) method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22049) Confusing behavior of from_utc_timestamp and to_utc_timestamp

2017-09-20 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22049.
--
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.3.0

Fixed in https://github.com/apache/spark/pull/19276

> Confusing behavior of from_utc_timestamp and to_utc_timestamp
> -
>
> Key: SPARK-22049
> URL: https://issues.apache.org/jira/browse/SPARK-22049
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.1.0
>Reporter: Felipe Olmos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Hello everyone,
> I am confused about the behavior of the functions {{from_utc_timestamp}} and 
> {{to_utc_timestamp}}. As an example, take the following code to a spark shell
> {code:java}
> import java.sql.Timestamp 
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> // 2017-07-14 02:40 UTC
> val rdd = sc.parallelize(Row(new Timestamp(15000L)) :: Nil)
> val df = spark.createDataFrame(rdd, StructType(StructField("date", 
> TimestampType) :: Nil))
> df.select(df("date"), from_utc_timestamp(df("date"), "GMT+01:00") as 
> "from_utc", to_utc_timestamp(df("date"), "GMT+01:00") as "to_utc").show(1, 
> false)
> // Date format printing is dependent on the timezone of the machine. 
> // The following is in UTC
> // +-+-+-+
>  
> // |date |from_utc |to_utc   |
> // +-+-+-+
> // |2017-07-14 02:40:00.0|2017-07-14 03:40:00.0|2017-07-14 01:40:00.0|
> // +-+-+-+
> df.select(unix_timestamp(df("date")) as "date", 
> unix_timestamp(from_utc_timestamp(df("date"), "GMT+01:00")) as "from_utc", 
> unix_timestamp(to_utc_timestamp(df("date"),  "GMT+01:00")) as 
> "to_utc").show(1, false)
> // +--+--+--+
> // |date  |from_utc  |to_utc|
> // +--+--+--+
> // |15|153600|146400|
> // +--+--+--+
> {code}
> So, if interpret correctly, {{from_utc_timestamp}} took {{02:40 UTC}} 
> interpreted it as {{03:40 GMT+1}} (same timestamp) and transformed it to 
> {{03:40 UTC}}. However the description of {{from_utc_timestamp}} says
> bq. Given a timestamp, which corresponds to a certain time of day in UTC, 
> returns another timestamp that corresponds to the same time of day in the 
> given timezone. 
> I would have then expected that the function take {{02:40 UTC}} and return 
> {{02:40 GMT+1 = 01:40 UTC}}. In fact, I think the descriptions of 
> {{from_utc_timestamp}} and {{to_utc_timestamp}} are inverted.
> I am interpreting this right?
> Thanks in advance
> Felipe



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22079) Serializer in HiveOutputWriter miss loading job configuration

2017-09-20 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-22079:
--

 Summary: Serializer in HiveOutputWriter miss loading job 
configuration
 Key: SPARK-22079
 URL: https://issues.apache.org/jira/browse/SPARK-22079
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Lantao Jin


Serializer in HiveOutputWriter misses loading job configuration. It will failed 
the customized SerDe object which extends 
{{org.apache.hadoop.hive.serde2.AbstractSerDe}} and overrides its 
initialize(Configuration,...) method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22020) Support session local timezone

2017-09-20 Thread Navya Krishnappa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navya Krishnappa reopened SPARK-22020:
--

This is not working as expected. Please refer the above-mentioned description

> Support session local timezone
> --
>
> Key: SPARK-22020
> URL: https://issues.apache.org/jira/browse/SPARK-22020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> Input data:
> Date,SparkDate,SparkDate1,SparkDate2
> 04/22/2017T03:30:02,2017-03-21T03:30:02,2017-03-21T03:30:02.02Z,2017-03-21T00:00:00Z
> I have set the below value to set the timeZone to UTC. It is adding the 
> current timeZone value even though it is in the UTC format.
> spark.conf.set("spark.sql.session.timeZone", "UTC")
> Expected : Time should remain same as the input since it's already in UTC 
> format
> var df1 = spark.read.option("delimiter", ",").option("qualifier", 
> "\"").option("inferSchema","true").option("header", "true").option("mode", 
> "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
>  "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");
> df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more 
> fields]
> scala> df1.show(false);
> --
> Name  Age Add DateSparkDate   SparkDate1  SparkDate2
> --
> abc   21  bvxc04/22/2017T03:30:02 2017-03-21 03:30:02 
> 2017-03-21 09:00:02.02  2017-03-21 05:30:00
> --



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2017-09-20 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved SPARK-17159.

Resolution: Won't Fix

Based on the feedback of https://github.com/apache/spark/pull/14731 ; I'm doing 
all my spark+cloud support elsewhere. 

closing as a WONTFIX; 

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22061) Add pipeline model of SVM

2017-09-20 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172983#comment-16172983
 ] 

Weichen Xu commented on SPARK-22061:


We already have `LinearSVC` and implemented by LBFGS which is faster than 
`SVMWithSGD` in mllib.

> Add pipeline model of SVM
> -
>
> Key: SPARK-22061
> URL: https://issues.apache.org/jira/browse/SPARK-22061
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Jiaming Shu
>  Labels: features
> Fix For: 2.3.0
>
>
> add pipeline implementation of SVM



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22075) GBTs forgot to unpersist datasets cached by Checkpointer

2017-09-20 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172859#comment-16172859
 ] 

zhengruifeng edited comment on SPARK-22075 at 9/20/17 9:48 AM:
---

-Same issue seems exist in {{Pregel}}, each call of {{connectedComponents}} 
will generate two new cached rdds

{code}
scala> import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.GraphLoader

scala> val graph = GraphLoader.edgeListFile(sc, 
"data/graphx/followers.txt").persist()
graph: org.apache.spark.graphx.Graph[Int,Int] = 
org.apache.spark.graphx.impl.GraphImpl@3c20abd6

scala> sc.getPersistentRDDs
res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75)

scala> val cc = graph.connectedComponents()
17/09/20 15:33:39 WARN BlockManager: Block rdd_11_0 already exists on this 
machine; not re-adding it
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@2cc3b0a7

scala> sc.getPersistentRDDs
res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 32 
-> VertexRDD ZippedPartitionsRDD2[32] at zipPartitions at 
VertexRDDImpl.scala:156)

scala> val cc = graph.connectedComponents()
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@69abff1d

scala> sc.getPersistentRDDs
res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 70 
-> VertexRDD ZippedPartitionsRDD2[70] at zipPartitions at 
VertexRDDImpl.scala:156, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 32 -> VertexRDD 
ZippedPartitionsRDD2[32] at zipPartitions at VertexRDDImpl.scala:156, 76 -> 
EdgeRDD ZippedPartitionsRDD2[76] at zipPartitions at 
ReplicatedVertexView.scala:112, 8 -> VertexRDD, VertexRDD MapPartitionsRDD[8] 
at mapPartitions at VertexRDD.scala:345)

{code}-

Pregel is OK, the intermediate rdds are unpersisted out of checkpointer


was (Author: podongfeng):
Same issue seems exist in {{Pregel}}, each call of {{connectedComponents}} will 
generate two new cached rdds

{code}
scala> import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.GraphLoader

scala> val graph = GraphLoader.edgeListFile(sc, 
"data/graphx/followers.txt").persist()
graph: org.apache.spark.graphx.Graph[Int,Int] = 
org.apache.spark.graphx.impl.GraphImpl@3c20abd6

scala> sc.getPersistentRDDs
res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75)

scala> val cc = graph.connectedComponents()
17/09/20 15:33:39 WARN BlockManager: Block rdd_11_0 already exists on this 
machine; not re-adding it
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@2cc3b0a7

scala> sc.getPersistentRDDs
res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 32 
-> VertexRDD ZippedPartitionsRDD2[32] at zipPartitions at 
VertexRDDImpl.scala:156)

scala> val cc = graph.connectedComponents()
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@69abff1d

scala> sc.getPersistentRDDs
res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 70 
-> VertexRDD ZippedPartitionsRDD2[70] at zipPartitions at 
VertexRDDImpl.scala:156, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 32 -> VertexRDD 
ZippedPartitionsRDD2[32] at zipPartitions

[jira] [Updated] (SPARK-22075) GBTs forgot to unpersist datasets cached by Checkpointer

2017-09-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22075:
-
Summary: GBTs forgot to unpersist datasets cached by Checkpointer  (was: 
GBTs/Pregel forgot to unpersist datasets cached by Checkpointer)

> GBTs forgot to unpersist datasets cached by Checkpointer
> 
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,values:array>>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> MapPartitionsRDD[307] at mapPartitions 
> at GradientBoostedTrees.scala:134, 8 -> *FileScan libsvm [label#0,features#1] 
> Batched: false, Format:

[jira] [Updated] (SPARK-22075) GBTs forgot to unpersist datasets cached by Checkpointer

2017-09-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22075:
-
Component/s: (was: GraphX)

> GBTs forgot to unpersist datasets cached by Checkpointer
> 
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,values:array>>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> MapPartitionsRDD[307] at mapPartitions 
> at GradientBoostedTrees.scala:134, 8 -> *FileScan libsvm [label#0,features#1] 
> Batched: false, Format: LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kme

[jira] [Commented] (SPARK-22066) Update checkstyle to 8.2, enable it, fix violations

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172951#comment-16172951
 ] 

Apache Spark commented on SPARK-22066:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19292

> Update checkstyle to 8.2, enable it, fix violations
> ---
>
> Key: SPARK-22066
> URL: https://issues.apache.org/jira/browse/SPARK-22066
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> While working on Scala 2.12 changes, I noted we could update various build 
> plugins to their latest version, such as the scala-maven-plugin.
> While doing that I noticed that the checkstyle plugin config had some bogus 
> config, and that it was actually disabled and could be enabled now (build 
> prints ERRORs but doesn't fail). This noise was making it a little hard to 
> figure out what errors are real and new in the Scala 2.12 changes I'm working 
> on, so should be fixed and stay fixed by being treated as errors.
> And then I noted we could update checkstyle itself to 8.2, and fix existing 
> and additional style errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22063:


Assignee: (was: Apache Spark)

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functions.R:483:52: style: Remove spaces before the left parenthesis in a 
> function call.
> jcols <- lapply(list(x, ...), function (x) {
>^
> R/functions.R:679:52: style: Remove spaces before the left parenthesis in a 
> function cal

[jira] [Assigned] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22063:


Assignee: Apache Spark

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functions.R:483:52: style: Remove spaces before the left parenthesis in a 
> function call.
> jcols <- lapply(list(x, ...), function (x) {
>^
> R/functions.R:679:52: style: Remove spaces before the left parenth

[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172950#comment-16172950
 ] 

Apache Spark commented on SPARK-22063:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19290

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functions.R:483:52: style: Remove spaces before the left parenthesis in a 
> function call.
> jcols <- lapply(list(x, ...), function (x) {
> 

[jira] [Resolved] (SPARK-22066) Update checkstyle to 8.2, enable it, fix violations

2017-09-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22066.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19282
[https://github.com/apache/spark/pull/19282]

> Update checkstyle to 8.2, enable it, fix violations
> ---
>
> Key: SPARK-22066
> URL: https://issues.apache.org/jira/browse/SPARK-22066
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> While working on Scala 2.12 changes, I noted we could update various build 
> plugins to their latest version, such as the scala-maven-plugin.
> While doing that I noticed that the checkstyle plugin config had some bogus 
> config, and that it was actually disabled and could be enabled now (build 
> prints ERRORs but doesn't fail). This noise was making it a little hard to 
> figure out what errors are real and new in the Scala 2.12 changes I'm working 
> on, so should be fixed and stay fixed by being treated as errors.
> And then I noted we could update checkstyle itself to 8.2, and fix existing 
> and additional style errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22028) spark-submit trips over environment variables

2017-09-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22028.
---
Resolution: Not A Problem

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> [ ... ]
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
> at 
> org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55)
> at 
> org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:54)
> at 
> org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:181)
> at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
> {code}
> Please note that _spark-submit.cmd_ is in this case my own script calling the 
> _spark-submit.cmd_ from the spark distribution.
> I think that shouldn't happen. Spark should handle such a malformed 
> environment variable gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172889#comment-16172889
 ] 

Sean Owen commented on SPARK-22077:
---

I'm not sure whether it's that IPv6 doesn't really work here in general, or 
just the parsing is overly aggressive. You can see the code that parses the 
URI, which parses successfully, but then it's not happy that it lacks a host or 
port or name or something.

> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> -
>
> Key: SPARK-22077
> URL: https://issues.apache.org/jira/browse/SPARK-22077
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.
> For example, 
> sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"
> is parsed as:
> host = null
> port = -1
> name = null
> While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.
> This is happening on our production machines and causing spark to not start 
> up.
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
>   at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>   at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>   at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
>   at org.apache.spark.executor.Executor.(Executor.scala:121)
>   at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:507)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22075) GBTs/Pregel forgot to unpersist datasets cached by Checkpointer

2017-09-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22075:
-
Summary: GBTs/Pregel forgot to unpersist datasets cached by Checkpointer  
(was: GBTs forgot to unpersist datasets cached by PeriodicRDDCheckpointer)

> GBTs/Pregel forgot to unpersist datasets cached by Checkpointer
> ---
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,values:array>>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> MapPartitionsRDD[307] at mapPartitions 
> at GradientBoostedTrees.scala:134, 8 -> *FileScan libsvm [label#0,featu

[jira] [Updated] (SPARK-22075) GBTs forgot to unpersist datasets cached by PeriodicRDDCheckpointer

2017-09-20 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22075:
-
Component/s: GraphX

> GBTs forgot to unpersist datasets cached by PeriodicRDDCheckpointer
> ---
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,values:array>>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:36 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:37 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:38 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel 
> (uid=gbtr_da1fe371a25e) with 20 trees
> scala> sc.getPersistentRDDs
> res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(322 -> MapPartitionsRDD[322] at mapPartitions at 
> GradientBoostedTrees.scala:134, 307 -> MapPartitionsRDD[307] at mapPartitions 
> at GradientBoostedTrees.scala:134, 8 -> *FileScan libsvm [label#0,features#1] 
> Batched: false, Format: LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/da

[jira] [Resolved] (SPARK-22025) Speeding up fromInternal for StructField

2017-09-20 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22025.
--
Resolution: Won't Fix

> Speeding up fromInternal for StructField
> 
>
> Key: SPARK-22025
> URL: https://issues.apache.org/jira/browse/SPARK-22025
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> Current code for StructField is simple function call.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434
> We can change it to function reference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22075) GBTs forgot to unpersist datasets cached by PeriodicRDDCheckpointer

2017-09-20 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172859#comment-16172859
 ] 

zhengruifeng commented on SPARK-22075:
--

Same issue seems exist in {{Pregel}}, each call of {{connectedComponents}} will 
generate two new cached rdds

{code}
scala> import org.apache.spark.graphx.GraphLoader
import org.apache.spark.graphx.GraphLoader

scala> val graph = GraphLoader.edgeListFile(sc, 
"data/graphx/followers.txt").persist()
graph: org.apache.spark.graphx.Graph[Int,Int] = 
org.apache.spark.graphx.impl.GraphImpl@3c20abd6

scala> sc.getPersistentRDDs
res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75)

scala> val cc = graph.connectedComponents()
17/09/20 15:33:39 WARN BlockManager: Block rdd_11_0 already exists on this 
machine; not re-adding it
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@2cc3b0a7

scala> sc.getPersistentRDDs
res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(8 -> 
VertexRDD, VertexRDD MapPartitionsRDD[8] at mapPartitions at 
VertexRDD.scala:345, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 32 
-> VertexRDD ZippedPartitionsRDD2[32] at zipPartitions at 
VertexRDDImpl.scala:156)

scala> val cc = graph.connectedComponents()
cc: org.apache.spark.graphx.Graph[org.apache.spark.graphx.VertexId,Int] = 
org.apache.spark.graphx.impl.GraphImpl@69abff1d

scala> sc.getPersistentRDDs
res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(38 -> EdgeRDD 
ZippedPartitionsRDD2[38] at zipPartitions at ReplicatedVertexView.scala:112, 70 
-> VertexRDD ZippedPartitionsRDD2[70] at zipPartitions at 
VertexRDDImpl.scala:156, 2 -> GraphLoader.edgeListFile - edges 
(data/graphx/followers.txt), EdgeRDD, EdgeRDD MapPartitionsRDD[2] at 
mapPartitionsWithIndex at GraphLoader.scala:75, 32 -> VertexRDD 
ZippedPartitionsRDD2[32] at zipPartitions at VertexRDDImpl.scala:156, 76 -> 
EdgeRDD ZippedPartitionsRDD2[76] at zipPartitions at 
ReplicatedVertexView.scala:112, 8 -> VertexRDD, VertexRDD MapPartitionsRDD[8] 
at mapPartitions at VertexRDD.scala:345)

{code}

> GBTs forgot to unpersist datasets cached by PeriodicRDDCheckpointer
> ---
>
> Key: SPARK-22075
> URL: https://issues.apache.org/jira/browse/SPARK-22075
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> {{PeriodicRDDCheckpointer}} will automatically persist the last 3 datasets 
> called by {{PeriodicRDDCheckpointer.update}}.
> In GBTs, the last 3 intermediate rdds are still cached after {{fit()}}
> {code}
> scala> val dataset = 
> spark.read.format("libsvm").load("./data/mllib/sample_kmeans_data.txt")
> dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]   
>   
> scala> dataset.persist()
> res0: dataset.type = [label: double, features: vector]
> scala> dataset.count
> res1: Long = 6
> scala> sc.getPersistentRDDs
> res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
> Map(8 -> *FileScan libsvm [label#0,features#1] Batched: false, Format: 
> LibSVM, Location: 
> InMemoryFileIndex[file:/Users/zrf/.dev/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,values:array>>
>  MapPartitionsRDD[8] at persist at :26)
> scala> import org.apache.spark.ml.regression._
> import org.apache.spark.ml.regression._
> scala> val model = gbt.fit(dataset)
> :28: error: not found: value gbt
>val model = gbt.fit(dataset)
>^
> scala> val gbt = new GBTRegressor()
> gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_da1fe371a25e
> scala> val model = gbt.fit(dataset)
> 17/09/20 14:05:33 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/09/20 14:05:35 WARN DecisionTreeMetadata: DecisionTree reducing maxBins 
> from 32 to 6 (= number of training instances)
> 17/0

[jira] [Created] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces

2017-09-20 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22078:
---

 Summary: clarify exception behaviors for all data source v2 
interfaces
 Key: SPARK-22078
 URL: https://issues.apache.org/jira/browse/SPARK-22078
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22077) RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

2017-09-20 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-22077:
---

 Summary: RpcEndpointAddress fails to parse spark URL if it is an 
ipv6 address.
 Key: SPARK-22077
 URL: https://issues.apache.org/jira/browse/SPARK-22077
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.0.0
Reporter: Eric Vandenberg
Priority: Minor


RpcEndpointAddress fails to parse spark URL if it is an ipv6 address.

For example, 
sparkUrl = "spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243"

is parsed as:
host = null
port = -1
name = null

While sparkUrl = spark://HeartbeatReceiver@localhost:55691 is parsed properly.

This is happening on our production machines and causing spark to not start up.

org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@2401:db00:2111:40a1:face:0:21:0:35243
at 
org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:65)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:133)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)
at org.apache.spark.executor.Executor.(Executor.scala:121)
at 
org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
at 
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.(SparkContext.scala:507)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2283)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:833)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:825)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:825)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements

2017-09-20 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-21977.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> SinglePartition optimizations break certain Streaming Stateful Aggregation 
> requirements
> ---
>
> Key: SPARK-21977
> URL: https://issues.apache.org/jira/browse/SPARK-21977
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.3.0
>
>
> This is a bit hard to explain as there are several issues here, I'll try my 
> best. Here are the requirements:
>   1. A StructuredStreaming Source that can generate empty RDDs with 0 
> partitions
>   2. A StructuredStreaming query that uses the above source, performs a 
> stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's 
> by 1
> The crux of the problem is that when a dataset has a `coalesce(1)` call, it 
> receives a `SinglePartition` partitioning scheme. This scheme satisfies most 
> required distributions used for aggregations such as HashAggregateExec. This 
> causes a world of problems:
>  Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 
> 0 partitions, nothing will be executed, the state store will not create any 
> delta files. When this happens, the next trigger fails, because the 
> StateStore fails to load the delta file for the previous trigger
>  Symptom 2. Let's say that there was data. Then in this case, if you stop 
> your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your 
> stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` 
> number of StateStores will fail to find its delta files.
> To fix the issues above, we must check that the partitioning of the child of 
> a `StatefulOperator` satisfies:
>   If the grouping expressions are empty:
> a) AllTuple distribution
> b) Single physical partition
>   If the grouping expressions are non empty:
> a) Clustered distribution
> b) spark.sql.shuffle.partition # of partitions
> whether or not coalesce(1) exists in the plan, and whether or not the input 
> RDD for the trigger has any data.
> Once you fix the above problem by adding an Exchange to the plan, you come 
> across the following bug:
> If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if 
> you have a trigger with no data, `StateStoreRestoreExec` doesn't return the 
> prior state. However, for this specific aggregation, `HashAggregateExec` 
> after the restore returns a (0, 0) row, since we're performing a count, and 
> there is no data. Then this data gets stored in `StateStoreSaveExec` causing 
> the previous counts to be overwritten and lost.
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org