date:20160313

[jira] [Assigned] (SPARK-13844) Generate better code for filters with a non-nullable column

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13844:


Assignee: (was: Apache Spark)

> Generate better code for filters with a non-nullable column
> ---
>
> Key: SPARK-13844
> URL: https://issues.apache.org/jira/browse/SPARK-13844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> We can generate smaller code for
> # and/or with a non-nullable column
> # divide/remainder with non-zero values
> # whole code generation with a non-nullable columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13844) Generate better code for filters with a non-nullable column

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192223#comment-15192223
 ] 

Apache Spark commented on SPARK-13844:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11684

> Generate better code for filters with a non-nullable column
> ---
>
> Key: SPARK-13844
> URL: https://issues.apache.org/jira/browse/SPARK-13844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> We can generate smaller code for
> # and/or with a non-nullable column
> # divide/remainder with non-zero values
> # whole code generation with a non-nullable columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13844) Generate better code for filters with a non-nullable column

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13844:


Assignee: Apache Spark

> Generate better code for filters with a non-nullable column
> ---
>
> Key: SPARK-13844
> URL: https://issues.apache.org/jira/browse/SPARK-13844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> We can generate smaller code for
> # and/or with a non-nullable column
> # divide/remainder with non-zero values
> # whole code generation with a non-nullable columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread jeanlyn (JIRA)

jeanlyn created SPARK-13845:
---

 Summary: Driver OOM after few days when running streaming
 Key: SPARK-13845
 URL: https://issues.apache.org/jira/browse/SPARK-13845
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.5.2
Reporter: jeanlyn


We have a streaming job using *FlumePollInputStream* always driver OOM after 
few days, here is some driver heap dump before OOM
{noformat}
 num #instances #bytes  class name
--
   1:  13845916  553836640  org.apache.spark.storage.BlockStatus
   2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
   3:  13883881  333213144  scala.collection.mutable.DefaultEntry
   4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
   5: 62360   65107352  [B
   6:163368   24453904  [Ljava.lang.Object;
   7:293651   20342664  [C
...
{noformat}
*BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13845:


Assignee: (was: Apache Spark)

> Driver OOM after few days when running streaming
> 
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192227#comment-15192227
 ] 

Apache Spark commented on SPARK-13845:
--

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/11679

> Driver OOM after few days when running streaming
> 
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13845:


Assignee: Apache Spark

> Driver OOM after few days when running streaming
> 
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: Apache Spark
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10775) add search keywords in history page ui

2016-03-13 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-10775.

   Resolution: Duplicate
Fix Version/s: 2.0.0

This issue is resolved by SPARK-10873.

> add search keywords in history page ui
> --
>
> Key: SPARK-10775
> URL: https://issues.apache.org/jira/browse/SPARK-10775
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Lianhui Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> add search button in history page ui that we can search some applications of 
> keywords,example: date or time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192259#comment-15192259
 ] 

Sean Owen commented on SPARK-13845:
---

[~jeanlyn] You didn't write a description, and the title does not reflect the 
cause you identified. Please update it, and read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Driver OOM after few days when running streaming
> 
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13846) VectorIndexer output on unknown feature should be more descriptive

2016-03-13 Thread Dmitry Spikhalskiy (JIRA)

Dmitry Spikhalskiy created SPARK-13846:
--

 Summary: VectorIndexer output on unknown feature should be more 
descriptive
 Key: SPARK-13846
 URL: https://issues.apache.org/jira/browse/SPARK-13846
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.1
Reporter: Dmitry Spikhalskiy
Priority: Minor


I got an exception and looks like it's related to unknown categorical variable 
value passed indexing.

java.util.NoSuchElementException: key not found: 20.0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

VectorIndexer created like
val featureIndexer = new VectorIndexer()
.setInputCol(DataFrameColumns.FEATURES)
.setOutputCol("indexedFeatures")
.setMaxCategories(5)
.fit(trainingDF)

Output should be not just default java.util.NoSuchElementException, but 
something specific like UnknownCategoricalValue with information, that could 
help to find the source element of vector (element index in vector maybe).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13846) VectorIndexer output on unknown feature should be more descriptive

2016-03-13 Thread Dmitry Spikhalskiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Spikhalskiy updated SPARK-13846:
---
Description: 
I got the exception and looks like it's related to unknown categorical variable 
value passed to indexing.

java.util.NoSuchElementException: key not found: 20.0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

VectorIndexer created like
val featureIndexer = new VectorIndexer()
.setInputCol(DataFrameColumns.FEATURES)
.setOutputCol("indexedFeatures")
.setMaxCategories(5)
.fit(trainingDF)

Output should be not just default java.util.NoSuchElementException, but 
something specific like UnknownCategoricalValue with information, that could 
help to find the source element of vector (element index in vector maybe).

  was:
I got an exception and looks like it's related to unknown categorical variable 
value passed indexing.

java.util.NoSuchElementException: key not found: 20.0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterato

[jira] [Updated] (SPARK-13846) VectorIndexer output on unknown feature should be more descriptive

2016-03-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13846:
--
Issue Type: Improvement  (was: Bug)

> VectorIndexer output on unknown feature should be more descriptive
> --
>
> Key: SPARK-13846
> URL: https://issues.apache.org/jira/browse/SPARK-13846
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Dmitry Spikhalskiy
>Priority: Minor
>
> I got the exception and looks like it's related to unknown categorical 
> variable value passed to indexing.
> java.util.NoSuchElementException: key not found: 20.0
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:58)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:316)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:315)
> at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:315)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
> at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> VectorIndexer created like
> val featureIndexer = new VectorIndexer()
> .setInputCol(DataFrameColumns.FEATURES)
> .setOutputCol("indexedFeatures")
> .setMaxCategories(5)
> .fit(trainingDF)
> Output should be not just default java.util.NoSuchElementException, but 
> something specific like UnknownCategoricalValue with information, that could 
> help to find the source element of vector (element index in vector maybe).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13810) Add Port Configuration Suggestions on Bind Exceptions

2016-03-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13810:
--
Assignee: Bjorn Jonsson

> Add Port Configuration Suggestions on Bind Exceptions
> -
>
> Key: SPARK-13810
> URL: https://issues.apache.org/jira/browse/SPARK-13810
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bjorn Jonsson
>Assignee: Bjorn Jonsson
>Priority: Minor
>  Labels: supportability
> Fix For: 1.6.2, 2.0.0
>
>
> When a java.net.BindException is thrown in startServiceOnPort in Utils.scala, 
> it displays the following message, irrespective of the port being used:
> java.net.BindException: Address already in use: Service '$serviceName' failed 
> after 16 retries!
> For supportability, it would be useful to add port configuration suggestions 
> for the related port being used. For example, if the Spark UI exceeds 
> spark.port.maxRetries, users should see:
> java.net.BindException: Address already in use: Service 'SparkUI' failed 
> after 16 retries! Consider setting spark.ui.port to an available port or 
> increasing spark.port.maxRetries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13810) Add Port Configuration Suggestions on Bind Exceptions

2016-03-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13810.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 11644
[https://github.com/apache/spark/pull/11644]

> Add Port Configuration Suggestions on Bind Exceptions
> -
>
> Key: SPARK-13810
> URL: https://issues.apache.org/jira/browse/SPARK-13810
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bjorn Jonsson
>Priority: Minor
>  Labels: supportability
> Fix For: 2.0.0, 1.6.2
>
>
> When a java.net.BindException is thrown in startServiceOnPort in Utils.scala, 
> it displays the following message, irrespective of the port being used:
> java.net.BindException: Address already in use: Service '$serviceName' failed 
> after 16 retries!
> For supportability, it would be useful to add port configuration suggestions 
> for the related port being used. For example, if the Spark UI exceeds 
> spark.port.maxRetries, users should see:
> java.net.BindException: Address already in use: Service 'SparkUI' failed 
> after 16 retries! Consider setting spark.ui.port to an available port or 
> increasing spark.port.maxRetries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-03-13 Thread Guram Savinov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192281#comment-15192281
 ] 

Guram Savinov commented on SPARK-12216:
---

I have the same problem when exit from spark-shell on windows 7.
Seems that it's not the permission problems because I start console as admin 
and have no problems with removng this directories manually.
Maybe this directory is locked by some thread when shutdown hook executes.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2016-03-13 Thread Guram Savinov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192281#comment-15192281
 ] 

Guram Savinov edited comment on SPARK-12216 at 3/13/16 10:49 AM:
-

I have the same problem when exit from spark-shell on windows 7.
Seems that it's not the permission problems because I start console as admin 
and have no problems with removng this directories manually.
Maybe this directory is locked by some thread when shutdown hook executes.

Take a look at this post, it has details about possible directory lock:
http://jakzaprogramowac.pl/pytanie/12478,how-to-find-which-java-scala-thread-has-locked-a-file


was (Author: gsavinov):
I have the same problem when exit from spark-shell on windows 7.
Seems that it's not the permission problems because I start console as admin 
and have no problems with removng this directories manually.
Maybe this directory is locked by some thread when shutdown hook executes.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-13 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192290#comment-15192290
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
[~josephkb] is there any comment before i start the PR ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-13 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192290#comment-15192290
 ] 

Mohamed Baddar edited comment on SPARK-13073 at 3/13/16 11:03 AM:
--

[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
 is there any comment before i start the PR ?


was (Author: mbaddar1):
[~josephkb] After more investigation in the code , and to make minimal changes 
on the code.My previous suggestion may not be suitable .I think we can 
implement toString version for BinaryLogisticRegressionSummary that give 
different information than R summary. It will create string representation for 
the following members :
precision 
recall
fmeasure
[~josephkb] is there any comment before i start the PR ?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13718) Scheduler "creating" straggler node

2016-03-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192309#comment-15192309
 ] 

Sean Owen commented on SPARK-13718:
---

What do you mean that it tries to assign without an available core (slot?). I 
understand that if the data-local nodes are all busy, and the task is 
I/O-intensive, the reading remotely is not only going to be slower but put more 
load on the busy nodes. But, they may not be I/O-bound, just have all slots 
occupied. Or the job may not be I/O-intensive at all in which case data 
locality doesn't help. In this case, not scheduling the task is suboptimal.

But, when is it better to not schedule the task at all? you're saying it 
creates a straggler, but all you're saying is things take a while when 
resources are constrained. What is the better scheduling decision, even given 
omniscience?

> Scheduler "creating" straggler node 
> 
>
> Key: SPARK-13718
> URL: https://issues.apache.org/jira/browse/SPARK-13718
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.3.1
> Environment: Spark 1.3.1
> MapR-FS
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 48 cores & 512 RAM / node
> Data Replication factor of 3
> Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM.
>Reporter: Ioannis Deligiannis
>Priority: Minor
> Attachments: TestIssue.java, spark_struggler.jpg
>
>
> *Data:*
> * Assume an even distribution of data across the cluster with a replication 
> factor of 3.
> * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3 
> requests can be executed concurrently(-ish) )
> *Action:*
> * Action is a simple sequence of map/filter/reduce. 
> * The action operates upon and returns a small subset of data (following the 
> full map over the data).
> * Data are 1 x cached serialized in memory (Kryo), so calling the action  
> should not hit the disk under normal conditions.
> * Action network usage is low as it returns a small number of aggregated 
> results and does not require excessive shuffling
> * Under low or moderate load, each action is expected to complete in less 
> than 2 seconds
> *H/W Outlook*
> When the action is called in high numbers, initially the cluster CPU gets 
> close to 100% (which is expected & intended). 
> After a while, the cluster utilization reduces significantly with only one 
> (struggler) node having 100% CPU and fully utilized network.
> *Diagnosis:*
> 1. Attached a profiler to the driver and executors to monitor GC or I/O 
> issues and everything is normal under low or heavy usage. 
> 2. Cluster network usage is very low
> 3. No issues on Spark UI except that tasks begin to  move from LOCAL to ANY
> *Cause (Corrected as found details in code):* 
> 1. Node 'H' is doing marginally more work than the rest (being a little 
> slower and at almost 100% CPU)
> 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the 
> task to other nodes 
> 3. One of the nodes 'X' that accepted the task will try to access the data 
> from node 'H' HDD. This adds Network I/O to node and also some extra CPU for 
> I/O.
> 4. 'X' time to complete increases ~5x as it goes over Network 
> 5. Eventually, every node will have a task that is waiting to fetch that 
> specific partition from node 'H' so cluster is basically blocked on a single 
> node
> What I managed to figure out from the code is that this is because if an RDD 
> is cached, it will make use of BlockManager.getRemote() and will not 
> recompute the DAG part that resulted in this RDD and hence always hit the 
> node that has cached the RDD.
> * Proposed Fix *
> I have not worked with Scala & Spark source code enough to propose a code 
> fix, but on a high level, when a task hits the 'spark.locality.wait' timeout, 
> it could make use of a new configuration e.g. 
> recomputeRddAfterLocalityTimeout instead of always trying to get the cached 
> RDD. This would be very useful if it could also be manually set on the RDD.
> *Workaround*
> Playing with 'spark.locality.wait' values, there is a deterministic value 
> depending on partitions and config where the problem ceases to exist.
> *PS1* : Don't have enough Scala skils to follow the issue or propose a fix, 
> but I hope that this has enough information to make sense.
> *PS2* : Debugging this issue made me realize that there can be a lot of 
> use-cases that trigger this behaviour



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13298) DAG visualization does not render correctly for jobs

2016-03-13 Thread Todd Leo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192324#comment-15192324
 ] 

Todd Leo commented on SPARK-13298:
--

Pls also see SPARK-13645: DAG Diagram not shown properly in Chrome

> DAG visualization does not render correctly for jobs
> 
>
> Key: SPARK-13298
> URL: https://issues.apache.org/jira/browse/SPARK-13298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Lucas Woltmann
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
> Attachments: dag_full.png, dag_viz.png, no-dag.png
>
>
> Whenever I try to open the DAG for a job, I get something like this:
> !dag_viz.png!
> Obviously the svg doesn't get resized, but if I resize it manually, only the 
> first of four stages in the DAG is shown. 
> The js console says (variable v is null in peg$c34):
> {code:javascript}
> Uncaught TypeError: Cannot read property '3' of null
>   peg$c34 @ graphlib-dot.min.js:1
>   peg$parseidDef @ graphlib-dot.min.js:1
>   peg$parseaList @ graphlib-dot.min.js:1
>   peg$parseattrListBlock @ graphlib-dot.min.js:1
>   peg$parseattrList @ graphlib-dot.min.js:1
>   peg$parsenodeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsegraphStmt @ graphlib-dot.min.js:1
>   parse @ graphlib-dot.min.js:2
>   readOne @ graphlib-dot.min.js:2
>   renderDot @ spark-dag-viz.js:281
>   (anonymous function) @ spark-dag-viz.js:248
>   (anonymous function) @ d3.min.js:
>   3Y @ d3.min.js:1
>   _a.each @ d3.min.js:3
>   renderDagVizForJob @ spark-dag-viz.js:207
>   renderDagViz @ spark-dag-viz.js:163
>   toggleDagViz @ spark-dag-viz.js:100
>   onclick @ ?id=2:153
> {code}
> (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge

2016-03-13 Thread Zhaokang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaokang Wang updated SPARK-6378:
-
Attachment: TripletsViewDonotUpdate.scala

I have met a similar problem with triplets update in GraphX.
I think I have a code demo that can reproduce the situation of this issue.
I reproduce the issue on a small toy graph with only 3 vertices. My demo code 
has been attached as [^TripletsViewDonotUpdate.scala].

Let me describe the steps to reproduce the issue:
1. We have constructed a small graph ({{purGraph}} in the code) with only 3 
vertices. The edges of the graph are: 2->1, 3->1, 2->3.
2. Conduct the collect neighbors operation to get the {{inNeighborGraph}} of 
the  {{purGraph}}.
3. Outer join the  {{inNeighborGraph}} vertices on {{purGraph}} to get the 
{{dataGraph}}. In {{dataGraph}}, each vertex will store an ArrayBuffer of its 
in neighbors' vertex id list.
4. Now we can examine the {{inNeighbor}} attribute in {{dataGraph.vertices}} 
view and {{dataGraph.triplets}} view. We can see from the output that the two 
views are inconsistent on vertex 3's {{inNeighbor}} property:

{quote}
> dataGraph.vertices
vid: 1, inNeighbor:2,3
vid: 3, inNeighbor:2
vid: 2, inNeighbor:
> dataGraph.triplets.srcAttr
vid: 2, inNeighbor:
vid: 2, inNeighbor:
vid: 3, inNeighbor:
{quote}

5. If we comment the {{purGraph.triplets.count()}} statement in the code, the 
bug will disappear:
{code}
val purGraph = Graph(dataVertex, dataEdge).persist()
  // purGraph.triplets.count() // !!!comment this
val inNeighborGraph = purGraph.collectNeighbors(EdgeDirection.In)
// Now join the in neighbor vertex id list to every vertex's property
val dataGraph = purGraph.outerJoinVertices(inNeighborGraph)((vid, property, 
inNeighborList) => {
  val inNeighborVertexIds = inNeighborList.getOrElse(Array[(VertexId, 
VertexProperty)]()).map(t => t._1)
  property.inNeighbor ++= inNeighborVertexIds.toBuffer
  property
})
{code}

It seems that the triplets view and the vertex view of the same graph may be 
inconsistent in some situation.

> srcAttr in graph.triplets don't update when the size of graph is huge
> -
>
> Key: SPARK-6378
> URL: https://issues.apache.org/jira/browse/SPARK-6378
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.1
>Reporter: zhangzhenyue
> Attachments: TripletsViewDonotUpdate.scala
>
>
> when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the 
> srcAttr and dstAttr in graph.triplets don't update when using the 
> Graph.outerJoinVertices(when the data in vertex is changed).
> the code and the log is as follows:
> {quote}
> g = graph.outerJoinVertices()...
> g,vertices,count()
> g.edges.count()
> println("example edge " + g.triplets.filter(e => e.srcId == 
> 51L).collect()
>   .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + 
> e.dstAttr)).mkString("\n"))
> println("example vertex " + g.vertices.filter(e => e._1 == 
> 51L).collect()
>   .map(e => (e._1 + "," + e._2)).mkString("\n"))
> {quote}
> the result:
> {quote}
> example edge 51:0, 2467451620:61
> 51:0, 1962741310:83 // attr of vertex 51 is 0 in 
> Graph.triplets
> example vertex 51,2 // attr of vertex 51 is 2 in 
> Graph.vertices
> {quote}
> when the graph is smaller(10 million vertex), the code is OK, the triplets 
> will update when the vertex is changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge

2016-03-13 Thread Zhaokang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192333#comment-15192333
 ] 

Zhaokang Wang commented on SPARK-6378:
--

I have reproduced this problem in a small toy graph demo with about 50 lines of 
code. I have described the detail in the following comment, please follow it if 
you have time. Thanks a lot!

> srcAttr in graph.triplets don't update when the size of graph is huge
> -
>
> Key: SPARK-6378
> URL: https://issues.apache.org/jira/browse/SPARK-6378
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.1
>Reporter: zhangzhenyue
> Attachments: TripletsViewDonotUpdate.scala
>
>
> when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the 
> srcAttr and dstAttr in graph.triplets don't update when using the 
> Graph.outerJoinVertices(when the data in vertex is changed).
> the code and the log is as follows:
> {quote}
> g = graph.outerJoinVertices()...
> g,vertices,count()
> g.edges.count()
> println("example edge " + g.triplets.filter(e => e.srcId == 
> 51L).collect()
>   .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + 
> e.dstAttr)).mkString("\n"))
> println("example vertex " + g.vertices.filter(e => e._1 == 
> 51L).collect()
>   .map(e => (e._1 + "," + e._2)).mkString("\n"))
> {quote}
> the result:
> {quote}
> example edge 51:0, 2467451620:61
> 51:0, 1962741310:83 // attr of vertex 51 is 0 in 
> Graph.triplets
> example vertex 51,2 // attr of vertex 51 is 2 in 
> Graph.vertices
> {quote}
> when the graph is smaller(10 million vertex), the code is OK, the triplets 
> will update when the vertex is changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12216) Spark failed to delete temp directory

2016-03-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12216.
---
Resolution: Invalid

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6378) srcAttr in graph.triplets don't update when the size of graph is huge

2016-03-13 Thread Zhaokang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192332#comment-15192332
 ] 

Zhaokang Wang edited comment on SPARK-6378 at 3/13/16 1:08 PM:
---

I have met a similar problem with triplets update in GraphX.
I think I have a code demo that can reproduce the situation of this issue.
I reproduce the issue on a small toy graph with only 3 vertices. My demo code 
has been attached as [^TripletsViewDonotUpdate.scala].

The key code is shown as the following:

{code}
// purGraph is a toy graph with edges: 2->1, 3->1, 2->3. 
val purGraph = Graph(dataVertex, dataEdge).persist() 
purGraph.triplets.count() // this operation will cause the bug.
val inNeighborGraph = purGraph.collectNeighbors(EdgeDirection.In) 
val dataGraph = purGraph.outerJoinVertices(inNeighborGraph)((vid, property, 
inNeighborList) => {... }) 

// dataGraph's vertices view and triplets view will be inconsistent on 
vertex 3's inNeighbor attribute. 
dataGraph.vertices.foreach {...} 
dataGraph.triplets.foreach {...}
{code}

We can see from the output that the two views are inconsistent on *vertex 3*'s 
{{inNeighbor}} property:

{quote}
> dataGraph.vertices
vid: 1, inNeighbor:2,3
vid: 3, inNeighbor:2
vid: 2, inNeighbor:
> dataGraph.triplets.srcAttr
vid: 2, inNeighbor:
vid: 2, inNeighbor:
vid: 3, inNeighbor:
{quote}

5. If we comment the {{purGraph.triplets.count()}} statement in the code, the 
bug will disappear:
{code}
val purGraph = Graph(dataVertex, dataEdge).persist()
// purGraph.triplets.count() // !!!comment this
val inNeighborGraph = purGraph.collectNeighbors(EdgeDirection.In)
// Now join the in neighbor vertex id list to every vertex's property
val dataGraph = purGraph.outerJoinVertices(inNeighborGraph)((vid, property, 
inNeighborList) => {...})
{code}

It seems that the triplets view and the vertex view of the same graph may be 
inconsistent in some situation.


was (Author: bsidb):
I have met a similar problem with triplets update in GraphX.
I think I have a code demo that can reproduce the situation of this issue.
I reproduce the issue on a small toy graph with only 3 vertices. My demo code 
has been attached as [^TripletsViewDonotUpdate.scala].

Let me describe the steps to reproduce the issue:
1. We have constructed a small graph ({{purGraph}} in the code) with only 3 
vertices. The edges of the graph are: 2->1, 3->1, 2->3.
2. Conduct the collect neighbors operation to get the {{inNeighborGraph}} of 
the  {{purGraph}}.
3. Outer join the  {{inNeighborGraph}} vertices on {{purGraph}} to get the 
{{dataGraph}}. In {{dataGraph}}, each vertex will store an ArrayBuffer of its 
in neighbors' vertex id list.
4. Now we can examine the {{inNeighbor}} attribute in {{dataGraph.vertices}} 
view and {{dataGraph.triplets}} view. We can see from the output that the two 
views are inconsistent on vertex 3's {{inNeighbor}} property:

{quote}
> dataGraph.vertices
vid: 1, inNeighbor:2,3
vid: 3, inNeighbor:2
vid: 2, inNeighbor:
> dataGraph.triplets.srcAttr
vid: 2, inNeighbor:
vid: 2, inNeighbor:
vid: 3, inNeighbor:
{quote}

5. If we comment the {{purGraph.triplets.count()}} statement in the code, the 
bug will disappear:
{code}
val purGraph = Graph(dataVertex, dataEdge).persist()
  // purGraph.triplets.count() // !!!comment this
val inNeighborGraph = purGraph.collectNeighbors(EdgeDirection.In)
// Now join the in neighbor vertex id list to every vertex's property
val dataGraph = purGraph.outerJoinVertices(inNeighborGraph)((vid, property, 
inNeighborList) => {
  val inNeighborVertexIds = inNeighborList.getOrElse(Array[(VertexId, 
VertexProperty)]()).map(t => t._1)
  property.inNeighbor ++= inNeighborVertexIds.toBuffer
  property
})
{code}

It seems that the triplets view and the vertex view of the same graph may be 
inconsistent in some situation.

> srcAttr in graph.triplets don't update when the size of graph is huge
> -
>
> Key: SPARK-6378
> URL: https://issues.apache.org/jira/browse/SPARK-6378
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.1
>Reporter: zhangzhenyue
> Attachments: TripletsViewDonotUpdate.scala
>
>
> when the size of the graph is huge(0.2 billion vertex, 6 billion edges), the 
> srcAttr and dstAttr in graph.triplets don't update when using the 
> Graph.outerJoinVertices(when the data in vertex is changed).
> the code and the log is as follows:
> {quote}
> g = graph.outerJoinVertices()...
> g,vertices,count()
> g.edges.count()
> println("example edge " + g.triplets.filter(e => e.srcId == 
> 51L).collect()
>   .map(e =>(e.srcId + ":" + e.srcAttr + ", " + e.dstId + ":" + 
> e.dstAttr)).mkString(

[jira] [Created] (SPARK-13847) Defer the variable evaluation for Limit codegen

2016-03-13 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-13847:
---

 Summary: Defer the variable evaluation for Limit codegen
 Key: SPARK-13847
 URL: https://issues.apache.org/jira/browse/SPARK-13847
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


We can defer the variable evaluation for Limit codegen as we did in Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13847) Defer the variable evaluation for Limit codegen

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13847:


Assignee: Apache Spark

> Defer the variable evaluation for Limit codegen
> ---
>
> Key: SPARK-13847
> URL: https://issues.apache.org/jira/browse/SPARK-13847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We can defer the variable evaluation for Limit codegen as we did in Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13847) Defer the variable evaluation for Limit codegen

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13847:


Assignee: (was: Apache Spark)

> Defer the variable evaluation for Limit codegen
> ---
>
> Key: SPARK-13847
> URL: https://issues.apache.org/jira/browse/SPARK-13847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We can defer the variable evaluation for Limit codegen as we did in Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13847) Defer the variable evaluation for Limit codegen

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192371#comment-15192371
 ] 

Apache Spark commented on SPARK-13847:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11685

> Defer the variable evaluation for Limit codegen
> ---
>
> Key: SPARK-13847
> URL: https://issues.apache.org/jira/browse/SPARK-13847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We can defer the variable evaluation for Limit codegen as we did in Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13847) Defer the variable evaluation for Limit codegen

2016-03-13 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-13847.
---
Resolution: Not A Problem

> Defer the variable evaluation for Limit codegen
> ---
>
> Key: SPARK-13847
> URL: https://issues.apache.org/jira/browse/SPARK-13847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We can defer the variable evaluation for Limit codegen as we did in Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13825) Upgrade to Scala 2.11.8

2016-03-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192389#comment-15192389
 ] 

Sean Owen commented on SPARK-13825:
---

While you're at it, I think it's fine to update the 2.10 line to 2.10.6. It 
includes only a very minor fix: http://www.scala-lang.org/news/2.10.6/

> Upgrade to Scala 2.11.8
> ---
>
> Key: SPARK-13825
> URL: https://issues.apache.org/jira/browse/SPARK-13825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Scala 2.11.8 is out so...time to upgrade before 2.0.0 is out -> 
> http://www.scala-lang.org/news/2.11.8/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13825) Upgrade to Scala 2.11.8

2016-03-13 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192496#comment-15192496
 ] 

Jacek Laskowski commented on SPARK-13825:
-

I can do it in a separate task. The title would not reflect the change (and 
changing it to Upgrade to Scala 2.10.6 and 2.11.8 would only confuse users). 
WDYT?

> Upgrade to Scala 2.11.8
> ---
>
> Key: SPARK-13825
> URL: https://issues.apache.org/jira/browse/SPARK-13825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Scala 2.11.8 is out so...time to upgrade before 2.0.0 is out -> 
> http://www.scala-lang.org/news/2.11.8/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13825) Upgrade to Scala 2.11.8

2016-03-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192498#comment-15192498
 ] 

Sean Owen commented on SPARK-13825:
---

I don't think it's confusing to bump maintenance releases of two Scala versions 
instead of one? titles can be changed. 

> Upgrade to Scala 2.11.8
> ---
>
> Key: SPARK-13825
> URL: https://issues.apache.org/jira/browse/SPARK-13825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Scala 2.11.8 is out so...time to upgrade before 2.0.0 is out -> 
> http://www.scala-lang.org/news/2.11.8/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13848) Upgrade to Py4J 0.9.2

2016-03-13 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13848:
--

 Summary: Upgrade to Py4J 0.9.2
 Key: SPARK-13848
 URL: https://issues.apache.org/jira/browse/SPARK-13848
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Josh Rosen
Assignee: Josh Rosen


We should upgrade to Py4J 0.9.2 so that we can fix SPARK-6047



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6722) Model import/export for StreamingKMeansModel

2016-03-13 Thread Furkan KAMACI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192533#comment-15192533
 ] 

Furkan KAMACI commented on SPARK-6722:
--

Is there any road map suggested about how to implement this?

> Model import/export for StreamingKMeansModel
> 
>
> Key: SPARK-6722
> URL: https://issues.apache.org/jira/browse/SPARK-6722
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
> (which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13784) Model export/import for spark.ml: RandomForests

2016-03-13 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192537#comment-15192537
 ] 

Gayathri Murali commented on SPARK-13784:
-

I can work on this, if no one else has started

> Model export/import for spark.ml: RandomForests
> ---
>
> Key: SPARK-13784
> URL: https://issues.apache.org/jira/browse/SPARK-13784
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for both RandomForestClassifier and RandomForestRegressor.  The 
> implementation should reuse the one for DecisionTree*.
> It should augment NodeData with a tree ID so that all nodes can be stored in 
> a single DataFrame.  It should reconstruct the trees in a distributed fashion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13812) Fix SparkR lint-r test errors

2016-03-13 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13812.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11652

> Fix SparkR lint-r test errors
> -
>
> Key: SPARK-13812
> URL: https://issues.apache.org/jira/browse/SPARK-13812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> After get updated from github, the lintr package can detect errors that are 
> not detected in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages

2016-03-13 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192540#comment-15192540
 ] 

Josh Rosen commented on SPARK-6047:
---

Resolving this as a duplicate of SPARK-5185, since I'm about to open a PR to 
fix that issue.

> pyspark - class loading on driver failing with --jars and --packages
> 
>
> Key: SPARK-6047
> URL: https://issues.apache.org/jira/browse/SPARK-6047
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>
> Because py4j uses the system ClassLoader instead of the contextClassLoader of 
> the thread, the dynamically added jars in Spark Submit can't be loaded in the 
> driver.
> This causes `Py4JError: Trying to call a package` errors.
> Usually `--packages` are downloaded from some remote repo before runtime, 
> adding them explicitly to `--driver-class-path` is not an option, like we can 
> do with `--jars`. One solution is to move the fetching of `--packages` to the 
> SparkSubmitDriverBootstrapper, and add it to the driver class-path there.
> A more complete solution can be achieved through [SPARK-4924].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages

2016-03-13 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6047.
---
Resolution: Duplicate

> pyspark - class loading on driver failing with --jars and --packages
> 
>
> Key: SPARK-6047
> URL: https://issues.apache.org/jira/browse/SPARK-6047
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>
> Because py4j uses the system ClassLoader instead of the contextClassLoader of 
> the thread, the dynamically added jars in Spark Submit can't be loaded in the 
> driver.
> This causes `Py4JError: Trying to call a package` errors.
> Usually `--packages` are downloaded from some remote repo before runtime, 
> adding them explicitly to `--driver-class-path` is not an option, like we can 
> do with `--jars`. One solution is to move the fetching of `--packages` to the 
> SparkSubmitDriverBootstrapper, and add it to the driver class-path there.
> A more complete solution can be achieved through [SPARK-4924].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5185) pyspark --jars does not add classes to driver class path

2016-03-13 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-5185:
-

Assignee: Josh Rosen  (was: Andrew Or)

> pyspark --jars does not add classes to driver class path
> 
>
> Key: SPARK-5185
> URL: https://issues.apache.org/jira/browse/SPARK-5185
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Uri Laserson
>Assignee: Josh Rosen
>
> I have some random class I want access to from an Spark shell, say 
> {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
> example I used here:
> https://gist.github.com/laserson/e9e3bd265e1c7a896652
> I packaged it as {{throwaway.jar}}.
> If I then run {{bin/spark-shell}} like so:
> {code}
> bin/spark-shell --master local[1] --jars throwaway.jar
> {code}
> I can execute
> {code}
> val a = new com.cloudera.science.throwaway.ThrowAway()
> {code}
> Successfully.
> I now run PySpark like so:
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar
> {code}
> which gives me an error when I try to instantiate the class through Py4J:
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __getattr__(self, name)
> 724 def __getattr__(self, name):
> 725 if name == '__call__':
> --> 726 raise Py4JError('Trying to call a package.')
> 727 new_fqn = self._fqn + '.' + name
> 728 command = REFLECTION_COMMAND_NAME +\
> Py4JError: Trying to call a package.
> {code}
> However, if I explicitly add the {{--driver-class-path}} to add the same jar
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar --driver-class-path throwaway.jar
> {code}
> it works
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> Out[1]: JavaObject id=o18
> {code}
> However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5185) pyspark --jars does not add classes to driver class path

2016-03-13 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5185:
--
Target Version/s: 2.0.0

> pyspark --jars does not add classes to driver class path
> 
>
> Key: SPARK-5185
> URL: https://issues.apache.org/jira/browse/SPARK-5185
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Uri Laserson
>Assignee: Josh Rosen
>
> I have some random class I want access to from an Spark shell, say 
> {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
> example I used here:
> https://gist.github.com/laserson/e9e3bd265e1c7a896652
> I packaged it as {{throwaway.jar}}.
> If I then run {{bin/spark-shell}} like so:
> {code}
> bin/spark-shell --master local[1] --jars throwaway.jar
> {code}
> I can execute
> {code}
> val a = new com.cloudera.science.throwaway.ThrowAway()
> {code}
> Successfully.
> I now run PySpark like so:
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar
> {code}
> which gives me an error when I try to instantiate the class through Py4J:
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __getattr__(self, name)
> 724 def __getattr__(self, name):
> 725 if name == '__call__':
> --> 726 raise Py4JError('Trying to call a package.')
> 727 new_fqn = self._fqn + '.' + name
> 728 command = REFLECTION_COMMAND_NAME +\
> Py4JError: Trying to call a package.
> {code}
> However, if I explicitly add the {{--driver-class-path}} to add the same jar
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar --driver-class-path throwaway.jar
> {code}
> it works
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> Out[1]: JavaObject id=o18
> {code}
> However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13848) Upgrade to Py4J 0.9.2

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192543#comment-15192543
 ] 

Apache Spark commented on SPARK-13848:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11687

> Upgrade to Py4J 0.9.2
> -
>
> Key: SPARK-13848
> URL: https://issues.apache.org/jira/browse/SPARK-13848
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should upgrade to Py4J 0.9.2 so that we can fix SPARK-6047



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192544#comment-15192544
 ] 

Apache Spark commented on SPARK-5185:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11687

> pyspark --jars does not add classes to driver class path
> 
>
> Key: SPARK-5185
> URL: https://issues.apache.org/jira/browse/SPARK-5185
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Uri Laserson
>Assignee: Josh Rosen
>
> I have some random class I want access to from an Spark shell, say 
> {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
> example I used here:
> https://gist.github.com/laserson/e9e3bd265e1c7a896652
> I packaged it as {{throwaway.jar}}.
> If I then run {{bin/spark-shell}} like so:
> {code}
> bin/spark-shell --master local[1] --jars throwaway.jar
> {code}
> I can execute
> {code}
> val a = new com.cloudera.science.throwaway.ThrowAway()
> {code}
> Successfully.
> I now run PySpark like so:
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar
> {code}
> which gives me an error when I try to instantiate the class through Py4J:
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __getattr__(self, name)
> 724 def __getattr__(self, name):
> 725 if name == '__call__':
> --> 726 raise Py4JError('Trying to call a package.')
> 727 new_fqn = self._fqn + '.' + name
> 728 command = REFLECTION_COMMAND_NAME +\
> Py4JError: Trying to call a package.
> {code}
> However, if I explicitly add the {{--driver-class-path}} to add the same jar
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar --driver-class-path throwaway.jar
> {code}
> it works
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> Out[1]: JavaObject id=o18
> {code}
> However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192570#comment-15192570
 ] 

Apache Spark commented on SPARK-13335:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/11688

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13335:


Assignee: (was: Apache Spark)

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-03-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13335:


Assignee: Apache Spark

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-03-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192588#comment-15192588
 ] 

Ruslan Dautkhanov commented on SPARK-13335:
---

It would be great to have this optimization in.
In our workflow we use Spark in 95% of cases, except for COLLECT_SET we switch 
to Hive because it works fine there, while in Spark 1.5 executors even with 8Gb 
of memory keeps running out of memory. 11 billion records dataset OUTER JOIN  
to a ~1 billion records dataset.

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13849) REGEX Column Specification

2016-03-13 Thread Ruslan Dautkhanov (JIRA)

Ruslan Dautkhanov created SPARK-13849:
-

 Summary: REGEX Column Specification
 Key: SPARK-13849
 URL: https://issues.apache.org/jira/browse/SPARK-13849
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Ruslan Dautkhanov
Priority: Critical


It would be great to have a feature parity with Hive on 
REGEX Column Specification:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification

A SELECT statement can take regex-based column specification in Hive:

{code:sql}
SELECT `(ds|hr)?+.+` FROM sales
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13834) Update sbt and sbt plugins for 2.x.

2016-03-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13834.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun  (was: Apache Spark)
Fix Version/s: 2.0.0

> Update sbt and sbt plugins for 2.x.
> ---
>
> Key: SPARK-13834
> URL: https://issues.apache.org/jira/browse/SPARK-13834
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> For 2.0.0, we had better make sbt and sbt plugins up-to-date. This PR checks 
> the status of each plugins and bumps the followings.
> * sbt: 0.13.9 --> 0.13.11
> * sbteclipse-plugin: 2.2.0 --> 4.0.0
> * sbt-dependency-graph: 0.7.4 --> 0.8.2
> * sbt-mima-plugin: 0.1.6 --> 0.1.9
> * sbt-revolver: 0.7.2 --> 0.8.0
> All other plugins are up-to-date. (Note that sbt-avro seems to be change from 
> 0.3.2 to 1.0.1, but it's not published in the repository.)
> During upgrade, this PR also updated the following MiMa error. Note that the 
> related excluding filter is already registered correctly. It seems due to the 
> change of MiMa exception result.
> {code:title=project/build.properties|borderStyle=solid}
>  // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics
>  
> ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"),
> -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"),
> +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13845) BlockStatus and StreamBlockId keep on growing result driver OOM

2016-03-13 Thread jeanlyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13845:

Summary: BlockStatus and StreamBlockId keep on growing result driver OOM  
(was: Driver OOM after few days when running streaming)

> BlockStatus and StreamBlockId keep on growing result driver OOM
> ---
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13724) Parameter maxMemoryInMB has gone missing in MlLib 1.6.0 DecisionTree.trainClassifier()

2016-03-13 Thread senthil gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192718#comment-15192718
 ] 

senthil gandhi commented on SPARK-13724:


I was able to replicate this, here is the exception, how does one deal with 
this from the python API?

--- 
py4j.protocol.Py4JJavaError: An error occurred while calling 
o163.trainDecisionTreeModel.
: java.lang.IllegalArgumentException: requirement failed: 
RandomForest/DecisionTree given maxMemoryInMB = 256, which is too small for the 
given features.  Minimum value = 446
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:186)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
at 
org.apache.spark.mllib.tree.DecisionTree$.train(DecisionTree.scala:86)
at 
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainDecisionTreeModel(PythonMLLibAPI.scala:748)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

> Parameter maxMemoryInMB has gone missing in MlLib 1.6.0 
> DecisionTree.trainClassifier()
> --
>
> Key: SPARK-13724
> URL: https://issues.apache.org/jira/browse/SPARK-13724
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: senthil gandhi
>
> DecisionTree.trainClassifier() reports that maxMemoryInMB is too small during 
> training and stops. But when I try to set it, I found that in MLlib of spark 
> 1.6.0 pyspark.mllib.tree.DecisionTree doesn't have this parameter in the 
> named parameter list anymore. 
> (Also not sure if this is the place for this issue, kindly educate!) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13823) Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)

2016-03-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13823.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Always specify Charset in String <-> byte[] conversions (and remaining 
> Coverity items)
> --
>
> Key: SPARK-13823
> URL: https://issues.apache.org/jira/browse/SPARK-13823
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> Most of the remaining items from the last Coverity scan concern using, for 
> example, the constructor {{new String(byte[])}} or the method 
> {{String.getBytes()}}, or similarly for constructors of {{InputStreamReader}} 
> and {{OutputStreamWriter}}. These use the platform default encoding, which 
> means their behavior may change in different locales, which should be 
> undesirable in all cases in Spark.
> It makes sense to specify UTF-8 as the default everywhere; where already 
> specified, it's UTF-8 in 95% of cases. A few tests set US-ASCII, but UTF-8 is 
> a superset.
> We should also consistently use {{StandardCharsets.UTF_8}} rather than 
> "UTF-8" or Guava's {{Charsets.UTF_8}} to specify this.
> (Finally, we should touch up the other few remaining Coverity scan items, 
> which are trivial, while we're here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13249) Filter null keys for inner join

2016-03-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192723#comment-15192723
 ] 

Liang-Chi Hsieh commented on SPARK-13249:
-

I think this can be closed?

> Filter null keys for inner join
> ---
>
> Key: SPARK-13249
> URL: https://issues.apache.org/jira/browse/SPARK-13249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> For inner join, the join key with null in it will not match each other, so we 
> could insert a Filter before inner join (could be pushed down), then we don't 
> need to check nullability of keys while joining.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13764) Parse modes in JSON data source

2016-03-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13764:
-
Description: 
Currently, JSON data source just fails to read if some JSON documents are 
malformed.

Therefore, if there are two JSON documents below:

{noformat}
{
  "request": {
"user": {
  "id": 123
}
  }
}
{noformat}

{noformat}
{
  "request": {
"user": []
  }
}
{noformat}

This will fail emitting the exception below :
{noformat}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost 
task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: 
org.apache.spark.sql.types.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
at 
org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

So, just like the parse modes in CSV data source, (See 
https://github.com/databricks/spark-csv), it would be great if there are some 
parse modes so that users do not have to filter or pre-process themselves.


This happens only when custom schema is set. when this uses inferred schema, 
then it infers the type as {{StringType}} which reads the data successfully 
anyway. 

  was:
Currently, JSON data source just fails to read if some JSON documents are 
malformed.

Therefore, if there are two JSON documents below:

{noformat}
{
  "request": {
"user": {
  "id": 123
}
  }
}
{noformat}

{noformat}
{
  "request": {
"user": []
  }
}
{noformat}

This will fail emitting the exception below :
{noformat}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost 
task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: 
org.apache.spark.sql.types.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)

[jira] [Created] (SPARK-13850) TimSort Comparison method violates its general contract

2016-03-13 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-13850:
---

 Summary: TimSort Comparison method violates its general contract
 Key: SPARK-13850
 URL: https://issues.apache.org/jira/browse/SPARK-13850
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.6.0
Reporter: Sital Kedia


While running a query which does a group by on a large dataset, the query fails 
with following stack trace. 

```
Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
method violates its general contract!
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

```

Please note that the same query used to succeed in Spark 1.5 so it seems like a 
regression in 1.6.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13850) TimSort Comparison method violates its general contract

2016-03-13 Thread Sital Kedia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-13850:

Description: 
While running a query which does a group by on a large dataset, the query fails 
with following stack trace. 

{code}

Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
method violates its general contract!
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

Please note that the same query used to succeed in Spark 1.5 so it seems like a 
regression in 1.6.



  was:
While running a query which does a group by on a large dataset, the query fails 
with following stack trace. 

```
Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
method violates its general contract!
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.col

[jira] [Created] (SPARK-13851) spark streaming web ui remains completed jobs as active jobs

2016-03-13 Thread t7s (JIRA)

t7s created SPARK-13851:
---

 Summary: spark streaming web ui remains completed jobs as active 
jobs
 Key: SPARK-13851
 URL: https://issues.apache.org/jira/browse/SPARK-13851
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
 Environment: yarn 2.3.0-cdh5.1.0
Reporter: t7s
Priority: Minor


!http://apache-spark-user-list.1001560.n3.nabble.com/file/n26474/34%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png!

env: yarn 2.3.0-cdh5.1.0
--master yarn-cluster
--conf "spark.driver.cores=2"
--conf "spark.akka.threads=16"

I am sure these job completed according to the log

For example, job 9816
This is info about job 9816 in log:
[2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
ErrorStreaming2.scala:396, took 8.218500 s 
(org.apache.spark.scheduler.DAGScheduler)

Stages in job 9816 are completed too according to the log

But job 9816 is still in active job of web ui, why?
How can I clear these remaining jobs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13851) spark streaming web ui remains completed jobs as active jobs

2016-03-13 Thread t7s (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t7s updated SPARK-13851:

Attachment: try.png

> spark streaming web ui remains completed jobs as active jobs
> 
>
> Key: SPARK-13851
> URL: https://issues.apache.org/jira/browse/SPARK-13851
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: yarn 2.3.0-cdh5.1.0
>Reporter: t7s
>Priority: Minor
> Attachments: try.png
>
>
> !http://apache-spark-user-list.1001560.n3.nabble.com/file/n26474/34%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png!
> env: yarn 2.3.0-cdh5.1.0
> --master yarn-cluster
> --conf "spark.driver.cores=2"
> --conf "spark.akka.threads=16"
> I am sure these job completed according to the log
> For example, job 9816
> This is info about job 9816 in log:
> [2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
> ErrorStreaming2.scala:396, took 8.218500 s 
> (org.apache.spark.scheduler.DAGScheduler)
> Stages in job 9816 are completed too according to the log
> But job 9816 is still in active job of web ui, why?
> How can I clear these remaining jobs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13851) spark streaming web ui remains completed jobs as active jobs

2016-03-13 Thread t7s (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t7s updated SPARK-13851:

Description: 
!try.png!

env: yarn 2.3.0-cdh5.1.0
--master yarn-cluster
--conf "spark.driver.cores=2"
--conf "spark.akka.threads=16"

I am sure these job completed according to the log

For example, job 9816
This is info about job 9816 in log:
[2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
ErrorStreaming2.scala:396, took 8.218500 s 
(org.apache.spark.scheduler.DAGScheduler)

Stages in job 9816 are completed too according to the log

But job 9816 is still in active job of web ui, why?
How can I clear these remaining jobs?

  was:
!http://apache-spark-user-list.1001560.n3.nabble.com/file/n26474/34%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png!

env: yarn 2.3.0-cdh5.1.0
--master yarn-cluster
--conf "spark.driver.cores=2"
--conf "spark.akka.threads=16"

I am sure these job completed according to the log

For example, job 9816
This is info about job 9816 in log:
[2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
ErrorStreaming2.scala:396, took 8.218500 s 
(org.apache.spark.scheduler.DAGScheduler)

Stages in job 9816 are completed too according to the log

But job 9816 is still in active job of web ui, why?
How can I clear these remaining jobs?


> spark streaming web ui remains completed jobs as active jobs
> 
>
> Key: SPARK-13851
> URL: https://issues.apache.org/jira/browse/SPARK-13851
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: yarn 2.3.0-cdh5.1.0
>Reporter: t7s
>Priority: Minor
> Attachments: try.png
>
>
> !try.png!
> env: yarn 2.3.0-cdh5.1.0
> --master yarn-cluster
> --conf "spark.driver.cores=2"
> --conf "spark.akka.threads=16"
> I am sure these job completed according to the log
> For example, job 9816
> This is info about job 9816 in log:
> [2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
> ErrorStreaming2.scala:396, took 8.218500 s 
> (org.apache.spark.scheduler.DAGScheduler)
> Stages in job 9816 are completed too according to the log
> But job 9816 is still in active job of web ui, why?
> How can I clear these remaining jobs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13851) spark streaming web ui remains completed jobs as active jobs

2016-03-13 Thread t7s (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t7s updated SPARK-13851:

Priority: Major  (was: Minor)

> spark streaming web ui remains completed jobs as active jobs
> 
>
> Key: SPARK-13851
> URL: https://issues.apache.org/jira/browse/SPARK-13851
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: yarn 2.3.0-cdh5.1.0
>Reporter: t7s
> Attachments: try.png
>
>
> !try.png!
> env: yarn 2.3.0-cdh5.1.0
> --master yarn-cluster
> --conf "spark.driver.cores=2"
> --conf "spark.akka.threads=16"
> I am sure these job completed according to the log
> For example, job 9816
> This is info about job 9816 in log:
> [2016-03-14 07:15:05,088] INFO Job 9816 finished: transform at 
> ErrorStreaming2.scala:396, took 8.218500 s 
> (org.apache.spark.scheduler.DAGScheduler)
> Stages in job 9816 are completed too according to the log
> But job 9816 is still in active job of web ui, why?
> How can I clear these remaining jobs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-13 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Description: 
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Please see this design doc for details: 
https://docs.google.com/document/d/1kGXQEcToNglDK-AbyEJGoYX9wwODimUVf0PetoxuU5M/edit

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!

  was:
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Please see the enclosed design doc for details.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!


> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
> Attachments: Spark-13618_Design_Doc_v1.pdf
>
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Please see this design doc for details: 
> https://docs.google.com/document/d/1kGXQEcToNglDK-AbyEJGoYX9wwODimUVf0PetoxuU5M/edit
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13642) Properly handle signal kill of ApplicationMaster

2016-03-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192828#comment-15192828
 ] 

Apache Spark commented on SPARK-13642:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11690

> Properly handle signal kill of ApplicationMaster
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandl

61 matches

Mail list logo