date:20160613

[jira] [Commented] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327822#comment-15327822
 ] 

Shixiong Zhu commented on SPARK-15869:
--

Do you have a reproducer?

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15899) file scheme should be used correctly

2016-06-13 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327824#comment-15327824
 ] 

Kazuaki Ishizaki commented on SPARK-15899:
--

When I added the two extra slashes, it works on Linux. But, it does not work on 
Windows. An exception has thrown. 
(/) Linux: {{file://}} + {{/path/to}}
(x) Windows: {{file://}} + {{c:/paths/to}}

This is because the difference of the original path format between platforms. I 
noticed that we have to add the three extra slashes (e.g. {{file:///}}) on 
Windows while we need to add the two extra slashes.

> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327822#comment-15327822
 ] 

Shixiong Zhu edited comment on SPARK-15869 at 6/13/16 5:47 PM:
---

Do you have a reproducer? "outputOpId" must not be `null`.


was (Author: zsxwing):
Do you have a reproducer?

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15613) Incorrect days to millis conversion

2016-06-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15613:
--

Assignee: Davies Liu

> Incorrect days to millis conversion 
> 
>
> Key: SPARK-15613
> URL: https://issues.apache.org/jira/browse/SPARK-15613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
> Environment: java version "1.8.0_91"
>Reporter: Dmitry Bushev
>Assignee: Davies Liu
>Priority: Critical
>
> There is an issue with {{DateTimeUtils.daysToMillis}} implementation. It  
> affects {{DateTimeUtils.toJavaDate}} and ultimately CatalystTypeConverter, 
> i.e the conversion of date stored as {{Int}} days from epoch in InternalRow 
> to {{java.sql.Date}} of Row returned to user.
>  
> The issue can be reproduced with this test (all the following tests are in my 
> defalut timezone Europe/Moscow):
> {code}
> $ sbt -Duser.timezone=Europe/Moscow catalyst/console
> scala> java.util.Calendar.getInstance().getTimeZone
> res0: java.util.TimeZone = 
> sun.util.calendar.ZoneInfo[id="Europe/Moscow",offset=1080,dstSavings=0,useDaylight=false,transitions=79,lastRule=null]
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res23: scala.collection.immutable.IndexedSeq[Int] = Vector(4108, 4473, 4838, 
> 5204, 5568, 5932, 6296, 6660, 7024, 7388, 8053, 8487, 8851, 9215, 9586, 9950, 
> 10314, 10678, 11042, 11406, 11777, 12141, 12505, 12869, 13233, 13597, 13968, 
> 14332, 14696, 15060)
> {code}
> For example, for {{4108}} day of epoch, the correct date should be 
> {{1981-04-01}}
> {code}
> scala> DateTimeUtils.toJavaDate(4107)
> res25: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4108)
> res26: java.sql.Date = 1981-03-31
> scala> DateTimeUtils.toJavaDate(4109)
> res27: java.sql.Date = 1981-04-02
> {code}
> There was previous unsuccessful attempt to work around the problem in 
> SPARK-11415. It seems that issue involves flaws in java date implementation 
> and I don't see how it can be fixed without third-party libraries.
> I was not able to identify the library of choice for Spark. The following 
> implementation uses [JSR-310|http://www.threeten.org/]
> {code}
> def millisToDays(millisUtc: Long): SQLDate = {
>   val instant = Instant.ofEpochMilli(millisUtc)
>   val zonedDateTime = instant.atZone(ZoneId.systemDefault)
>   zonedDateTime.toLocalDate.toEpochDay.toInt
> }
> def daysToMillis(days: SQLDate): Long = {
>   val localDate = LocalDate.ofEpochDay(days)
>   val zonedDateTime = localDate.atStartOfDay(ZoneId.systemDefault)
>   zonedDateTime.toInstant.toEpochMilli
> }
> {code}
> that produces correct results:
> {code}
> scala> for (days <- 0 to 2 if millisToDays(daysToMillis(days)) != days) 
> yield days
> res37: scala.collection.immutable.IndexedSeq[Int] = Vector()
> scala> new java.sql.Date(daysToMillis(4108))
> res36: java.sql.Date = 1981-04-01
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15899) file scheme should be used correctly

2016-06-13 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327825#comment-15327825
 ] 

Kazuaki Ishizaki commented on SPARK-15899:
--

When I added the two extra slashes, it works on Linux. But, it does not work on 
Windows. An exception has thrown. 
(/) Linux: {{file://}} + {{/path/to}}
(x) Windows: {{file://}} + {{c:/paths/to}}

This is because the difference of the original path format between platforms. I 
noticed that we have to add the three extra slashes (e.g. {{file:///}}) on 
Windows while we need to add the two extra slashes.

> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15899) file scheme should be used correctly

2016-06-13 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-15899:
-
Comment: was deleted

(was: When I added the two extra slashes, it works on Linux. But, it does not 
work on Windows. An exception has thrown. 
(/) Linux: {{file://}} + {{/path/to}}
(x) Windows: {{file://}} + {{c:/paths/to}}

This is because the difference of the original path format between platforms. I 
noticed that we have to add the three extra slashes (e.g. {{file:///}}) on 
Windows while we need to add the two extra slashes.)

> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15869) HTTP 500 and NPE on streaming batch details page

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327822#comment-15327822
 ] 

Shixiong Zhu edited comment on SPARK-15869 at 6/13/16 5:47 PM:
---

Do you have a reproducer? "outputOpId" must not be `null`.


was (Author: zsxwing):
Do you have a reproducer? "outputOpId" must not be `null`.

> HTTP 500 and NPE on streaming batch details page
> 
>
> Key: SPARK-15869
> URL: https://issues.apache.org/jira/browse/SPARK-15869
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> When I'm trying to show details of streaming batch I'm getting NPE.
> Sample link:
> http://127.0.0.1:4040/streaming/batch/?id=146555370
> Error:
> {code}
> HTTP ERROR 500
> Problem accessing /streaming/batch/. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JCollectionWrapper.iterator(Wrappers.scala:59)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:320)
>   at scala.collection.AbstractTraversable.groupBy(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.ui.BatchPage.generateJobTable(BatchPage.scala:273)
>   at org.apache.spark.streaming.ui.BatchPage.render(BatchPage.scala:358)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:81)
>   at org.apache.spark.ui.JettyUtils$$anon$2.doGet(JettyUtils.scala:83)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327851#comment-15327851
 ] 

Shivaram Venkataraman commented on SPARK-15924:
---

I'm not sure what part of this code snippet is specific to SparkR -- The code 
just looks like ggplot2 / base R functions ?

> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327884#comment-15327884
 ] 

Dongjoon Hyun commented on SPARK-15922:
---

Hi, [~chaz2505].
This is due to `toIndexedRowMatrix` bug. 
I'll make a PR for this.

> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15922:


Assignee: (was: Apache Spark)

> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15922:


Assignee: Apache Spark

> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>Assignee: Apache Spark
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327887#comment-15327887
 ] 

Apache Spark commented on SPARK-15922:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13643

> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-15925:
--

 Summary: Replaces registerTempTable with createOrReplaceTempView 
in SparkR
 Key: SPARK-15925
 URL: https://issues.apache.org/jira/browse/SPARK-15925
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR, SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15697) [SPARK REPL] unblock some of the useful repl commands.

2016-06-13 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15697.
--
   Resolution: Fixed
 Assignee: Prashant Sharma
Fix Version/s: 2.0.0

> [SPARK REPL] unblock some of the useful repl commands.
> --
>
> Key: SPARK-15697
> URL: https://issues.apache.org/jira/browse/SPARK-15697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Trivial
> Fix For: 2.0.0
>
>
> "implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
> However, they work fine in all cases I have tried. It is clear we don't 
> support them as they are part of the scala/scala repl project. What is the 
> harm in unblocking them, given they are useful ?
> In previous versions of spark we disabled these commands because it was 
> difficult to support them without customization and the associated 
> maintenance. Since the code base of scala repl was actually ported and 
> maintained under spark source. Now that is not the situation and one can 
> benefit from these commands in Spark REPL as much as in scala repl.
> Symantics of reset are to be discussed in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327918#comment-15327918
 ] 

Apache Spark commented on SPARK-15925:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13644

> Replaces registerTempTable with createOrReplaceTempView in SparkR
> -
>
> Key: SPARK-15925
> URL: https://issues.apache.org/jira/browse/SPARK-15925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15925:


Assignee: Cheng Lian  (was: Apache Spark)

> Replaces registerTempTable with createOrReplaceTempView in SparkR
> -
>
> Key: SPARK-15925
> URL: https://issues.apache.org/jira/browse/SPARK-15925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Wang closed SPARK-15924.
-
Resolution: Not A Problem

Not a problem of open source Spark

> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15666) Join on two tables generated from a same table throwing query analyzer issue

2016-06-13 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327919#comment-15327919
 ] 

Herman van Hovell commented on SPARK-15666:
---

Looking at the exception {{... in operator !Project [...}}, it seems like one 
of the underlying plans is broken.

> Join on two tables generated from a same table throwing query analyzer issue
> 
>
> Key: SPARK-15666
> URL: https://issues.apache.org/jira/browse/SPARK-15666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: AWS EMR
>Reporter: Manish Kumar
>Priority: Blocker
>
> If two dataframes (named leftdf and rightdf) which are created by performimg 
> some opeartions on a single dataframe are joined then we are getting some 
> analyzer issue:
> leftdf schema
> {noformat}
> root
>  |-- affinity_monitor_copay: string (nullable = true)
>  |-- affinity_monitor_digital_pull: string (nullable = true)
>  |-- affinity_monitor_digital_push: string (nullable = true)
>  |-- affinity_monitor_direct: string (nullable = true)
>  |-- affinity_monitor_peer: string (nullable = true)
>  |-- affinity_monitor_peer_interaction: string (nullable = true)
>  |-- affinity_monitor_personal_f2f: string (nullable = true)
>  |-- affinity_monitor_personal_remote: string (nullable = true)
>  |-- affinity_monitor_sample: string (nullable = true)
>  |-- affinity_monitor_voucher: string (nullable = true)
>  |-- afltn_id: string (nullable = true)
>  |-- attribute_2_value: string (nullable = true)
>  |-- brand: string (nullable = true)
>  |-- city: string (nullable = true)
>  |-- cycle_time_id: integer (nullable = true)
>  |-- full_name: string (nullable = true)
>  |-- hcp: string (nullable = true)
>  |-- like17_mg17_metric114_aggregated: double (nullable = true)
>  |-- like17_mg17_metric118_aggregated: double (nullable = true)
>  |-- metric_group_sk: integer (nullable = true)
>  |-- metrics: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- hcp: string (nullable = true)
>  |||-- brand: string (nullable = true)
>  |||-- rep: string (nullable = true)
>  |||-- month: string (nullable = true)
>  |||-- metric117: string (nullable = true)
>  |||-- metric114: string (nullable = true)
>  |||-- metric118: string (nullable = true)
>  |||-- specialty_1: string (nullable = true)
>  |||-- full_name: string (nullable = true)
>  |||-- pri_st: string (nullable = true)
>  |||-- city: string (nullable = true)
>  |||-- zip_code: string (nullable = true)
>  |||-- prsn_id: string (nullable = true)
>  |||-- afltn_id: string (nullable = true)
>  |||-- npi_id: string (nullable = true)
>  |||-- affinity_monitor_sample: string (nullable = true)
>  |||-- affinity_monitor_personal_f2f: string (nullable = true)
>  |||-- affinity_monitor_peer: string (nullable = true)
>  |||-- affinity_monitor_copay: string (nullable = true)
>  |||-- affinity_monitor_digital_push: string (nullable = true)
>  |||-- affinity_monitor_voucher: string (nullable = true)
>  |||-- affinity_monitor_direct: string (nullable = true)
>  |||-- affinity_monitor_peer_interaction: string (nullable = true)
>  |||-- affinity_monitor_digital_pull: string (nullable = true)
>  |||-- affinity_monitor_personal_remote: string (nullable = true)
>  |||-- attribute_2_value: string (nullable = true)
>  |||-- metric211: double (nullable = false)
>  |-- mg17_metric117_3: double (nullable = true)
>  |-- mg17_metric117_3_actual_metric: double (nullable = true)
>  |-- mg17_metric117_3_planned_metric: double (nullable = true)
>  |-- mg17_metric117_D_suggestion_id: integer (nullable = true)
>  |-- mg17_metric117_D_suggestion_text: string (nullable = true)
>  |-- mg17_metric117_D_suggestion_text_raw: string (nullable = true)
>  |-- mg17_metric117_exp_score: integer (nullable = true)
>  |-- mg17_metric117_severity_index: double (nullable = true)
>  |-- mg17_metric117_test: integer (nullable = true)
>  |-- mg17_metric211_P_suggestion_id: integer (nullable = true)
>  |-- mg17_metric211_P_suggestion_text: string (nullable = true)
>  |-- mg17_metric211_P_suggestion_text_raw: string (nullable = true)
>  |-- mg17_metric211_aggregated: double (nullable = false)
>  |-- mg17_metric211_deviationfrompeers_p_value: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_current_mu: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_p_value: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_previous_mu: double (nullable = true)
>  |-- mg17_metric211_exp_score: integer

[jira] [Assigned] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15925:


Assignee: Apache Spark  (was: Cheng Lian)

> Replaces registerTempTable with createOrReplaceTempView in SparkR
> -
>
> Key: SPARK-15925
> URL: https://issues.apache.org/jira/browse/SPARK-15925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-13 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327992#comment-15327992
 ] 

Pete Robbins commented on SPARK-15822:
--

So this does seem to cause the NPE or SEGV intermittently, ie I get some clean 
runs. However, I added some tracing to detect when the UnsafeRow looks corrupt 
(baseobject = null, offset=massive) and I see these in every run so I suspect 
there is always corruption but that doesn't always lead to a visible failure. 
The app usually gives the appearance of success as Spark re-submits the lost 
tasks and restarts failing executors. Here is what I think is the plan 
associated with one of the failing jobs:

== Parsed Logical Plan ==
'Project [unresolvedalias('Origin, None), unresolvedalias('UniqueCarrier, 
None), 'round((('count * 100) / 'total), 2) AS rank#927]
+- Project [Origin#16, UniqueCarrier#8, count#888L, total#851L]
   +- Join Inner, ((Origin#16 = Origin#909) && (UniqueCarrier#8 = 
UniqueCarrier#901))
  :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#888L]
  :  +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A))
  : +- Filter (Dest#17 = ORD)
  :+- 
Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,CarrierDelay#24,WeatherDelay#25,NASDelay#26,SecurityDelay#27,LateAircraftDelay#28]
 csv
  +- Project [Origin#909, UniqueCarrier#901, count#846L AS total#851L]
 +- Aggregate [Origin#909, UniqueCarrier#901], [Origin#909, 
UniqueCarrier#901, count(1) AS count#846L]
+- Filter (Dest#910 = ORD)
   +- 
Relation[Year#893,Month#894,DayofMonth#895,DayOfWeek#896,DepTime#897,CRSDepTime#898,ArrTime#899,CRSArrTime#900,UniqueCarrier#901,FlightNum#902,TailNum#903,ActualElapsedTime#904,CRSElapsedTime#905,AirTime#906,ArrDelay#907,DepDelay#908,Origin#909,Dest#910,Distance#911,TaxiIn#912,TaxiOut#913,Cancelled#914,CancellationCode#915,Diverted#916,CarrierDelay#917,WeatherDelay#918,NASDelay#919,SecurityDelay#920,LateAircraftDelay#921]
 csv

== Analyzed Logical Plan ==
Origin: string, UniqueCarrier: string, rank: double
Project [Origin#16, UniqueCarrier#8, round((cast((count#888L * cast(100 as 
bigint)) as double) / cast(total#851L as double)), 2) AS rank#927]
+- Project [Origin#16, UniqueCarrier#8, count#888L, total#851L]
   +- Join Inner, ((Origin#16 = Origin#909) && (UniqueCarrier#8 = 
UniqueCarrier#901))
  :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#888L]
  :  +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A))
  : +- Filter (Dest#17 = ORD)
  :+- 
Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,CarrierDelay#24,WeatherDelay#25,NASDelay#26,SecurityDelay#27,LateAircraftDelay#28]
 csv
  +- Project [Origin#909, UniqueCarrier#901, count#846L AS total#851L]
 +- Aggregate [Origin#909, UniqueCarrier#901], [Origin#909, 
UniqueCarrier#901, count(1) AS count#846L]
+- Filter (Dest#910 = ORD)
   +- 
Relation[Year#893,Month#894,DayofMonth#895,DayOfWeek#896,DepTime#897,CRSDepTime#898,ArrTime#899,CRSArrTime#900,UniqueCarrier#901,FlightNum#902,TailNum#903,ActualElapsedTime#904,CRSElapsedTime#905,AirTime#906,ArrDelay#907,DepDelay#908,Origin#909,Dest#910,Distance#911,TaxiIn#912,TaxiOut#913,Cancelled#914,CancellationCode#915,Diverted#916,CarrierDelay#917,WeatherDelay#918,NASDelay#919,SecurityDelay#920,LateAircraftDelay#921]
 csv

== Optimized Logical Plan ==
Project [Origin#16, UniqueCarrier#8, round((cast((count#888L * 100) as double) 
/ cast(total#851L as double)), 2) AS rank#927]
+- Join Inner, ((Origin#16 = Origin#909) && (UniqueCarrier#8 = 
UniqueCarrier#901))
   :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#888L]
   :  +- Project [UniqueCarrier#8, Origin#16]
   : +- Filter (isnotnull(UniqueCarrier#8) && isnotnull(Origin#16)) && 
isnotnull(Cancelled#21)) && isnotnull(CancellationCode#22)) && NOT 
(Cancelled#21 = 0)) && (CancellationCode#22 = A))
   :+- InMemoryRelation [Year#0, Month#1, DayofMonth#2, DayOfWeek#3, 
DepTime#4, CRSDepTime#5, ArrTime#6, CRSArrTime#7, UniqueCarrier#8, FlightNum#9, 
TailNum#10, ActualElapsedTime#11, CRSElapsedTime#12, AirTime#13, ArrDelay#14, 
DepDelay#15, Origin#16, Dest#17, Distance#18, TaxiIn#19, TaxiOut#20, 
Cancelled#21, CancellationCode#22, Diverted#23, Carrier

[jira] [Updated] (SPARK-15655) Wrong Result when Fetching Partitioned Tables

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15655:
-
Priority: Blocker  (was: Critical)

> Wrong Result when Fetching Partitioned Tables
> -
>
> Key: SPARK-15655
> URL: https://issues.apache.org/jira/browse/SPARK-15655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> When fetching the partitioned table, the output contains wrong results 
> regarding partitioning key. 
> {noformat}
> CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 
> string,p3 string,p4 string,p5 string)
> INSERT OVERWRITE TABLE table_with_partition PARTITION 
> (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr'
> SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition
> {noformat}
> {noformat}
> +---+---+---+---+---+-+
> | p1| p2| p3| p4| p5|   c1|
> +---+---+---+---+---+-+
> |  d|  e|  c|  b|  a|blarr|
> +---+---+---+---+---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15655) Wrong Result when Fetching Partitioned Tables

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15655:
-
Target Version/s: 2.0.0

> Wrong Result when Fetching Partitioned Tables
> -
>
> Key: SPARK-15655
> URL: https://issues.apache.org/jira/browse/SPARK-15655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> When fetching the partitioned table, the output contains wrong results 
> regarding partitioning key. 
> {noformat}
> CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 
> string,p3 string,p4 string,p5 string)
> INSERT OVERWRITE TABLE table_with_partition PARTITION 
> (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr'
> SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition
> {noformat}
> {noformat}
> +---+---+---+---+---+-+
> | p1| p2| p3| p4| p5|   c1|
> +---+---+---+---+---+-+
> |  d|  e|  c|  b|  a|blarr|
> +---+---+---+---+---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15655) Wrong Result when Fetching Partitioned Tables

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15655:
-
Assignee: Xiao Li

> Wrong Result when Fetching Partitioned Tables
> -
>
> Key: SPARK-15655
> URL: https://issues.apache.org/jira/browse/SPARK-15655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> When fetching the partitioned table, the output contains wrong results 
> regarding partitioning key. 
> {noformat}
> CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 
> string,p3 string,p4 string,p5 string)
> INSERT OVERWRITE TABLE table_with_partition PARTITION 
> (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr'
> SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition
> {noformat}
> {noformat}
> +---+---+---+---+---+-+
> | p1| p2| p3| p4| p5|   c1|
> +---+---+---+---+---+-+
> |  d|  e|  c|  b|  a|blarr|
> +---+---+---+---+---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15926) Improve readability of DAGScheduler stage creation methods

2016-06-13 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-15926:
--

 Summary: Improve readability of DAGScheduler stage creation methods
 Key: SPARK-15926
 URL: https://issues.apache.org/jira/browse/SPARK-15926
 Project: Spark
  Issue Type: Sub-task
  Components: Scheduler
Affects Versions: 2.0.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


The current code to create new stages is very confusing: it's difficult to 
reason about which functions actually create new stages versus just looking up 
existing stages, and there are many similarly-named functions that do very 
different things.  The goal of this JIRA is to clean some of that up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15927) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-15927:
--

 Summary: Eliminate redundant code in DAGScheduler's 
getParentStages and getAncestorShuffleDependencies methods.
 Key: SPARK-15927
 URL: https://issues.apache.org/jira/browse/SPARK-15927
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 2.0.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


The getParentStages and getAncestorShuffleDependencies methods have a lot of 
repeated code to traverse the dependency graph.  We should create a function 
that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15928) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-15928:
--

 Summary: Eliminate redundant code in DAGScheduler's 
getParentStages and getAncestorShuffleDependencies methods.
 Key: SPARK-15928
 URL: https://issues.apache.org/jira/browse/SPARK-15928
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 2.0.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


The getParentStages and getAncestorShuffleDependencies methods have a lot of 
repeated code to traverse the dependency graph.  We should create a function 
that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15928) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-15928.
--
Resolution: Duplicate

> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15928
> URL: https://issues.apache.org/jira/browse/SPARK-15928
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15927) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15927:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15927
> URL: https://issues.apache.org/jira/browse/SPARK-15927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15927) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328045#comment-15328045
 ] 

Apache Spark commented on SPARK-15927:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/13646

> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15927
> URL: https://issues.apache.org/jira/browse/SPARK-15927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15927) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15927:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15927
> URL: https://issues.apache.org/jira/browse/SPARK-15927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15899) file scheme should be used correctly

2016-06-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328056#comment-15328056
 ] 

Sean Owen commented on SPARK-15899:
---

OK, I had assumed that absolute paths on Windows would have to be specified 
like {{/c:/paths/to}}. I know that normal Windows paths are {{c:\paths\to}} of 
course, but this is a somewhat different context.

OK, so what is the value of the System property user.dir on Windows? does it 
start with c: and not /c: ? Yeah, then I see that this needs some special 
handling. You're right, we could have a utility method that just adds a {{/}} 
at the start if none is present, but that would silently turn relative paths 
into non-relative.

Can we use the {{File}} or {{Paths}} API to really do this right? it should 
give a URI object whose string rep is, I hope, exactly what's desired.

> file scheme should be used correctly
> 
>
> Key: SPARK-15899
> URL: https://issues.apache.org/jira/browse/SPARK-15899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
> {{file://host/}} or {{file:///}}. 
> [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
> [Some code 
> stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
>  use different prefix such as {{file:}}.
> It would be good to prepare a utility method to correctly add {{file://host}} 
> or {{file://} prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15784:


Assignee: Apache Spark

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Apache Spark
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15784:


Assignee: (was: Apache Spark)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328061#comment-15328061
 ] 

Apache Spark commented on SPARK-15784:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/13647

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-06-13 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328181#comment-15328181
 ] 

Thomas Graves commented on SPARK-15923:
---

can you give some more details?

Did you have your application history log dir configured 
(spark.history.fs.logDirectory)?   Was the event log for your Spark Pi written 
to that directory?
Was the history server being slow in loading the file (see the log file for the 
history server)?

does the UI show it going to http://:18080?

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15676:
-
Assignee: Xiao Li

> Disallow Column Names as Partition Columns For Hive Tables
> --
>
> Key: SPARK-15676
> URL: https://issues.apache.org/jira/browse/SPARK-15676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Below is a common mistake users might make:
> {noformat}
> hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data 
> string, part string);
> FAILED: SemanticException [Error 10035]: Column repeated in partitioning 
> columns
> {noformat}
> Different from what Hive returned, currently, we return a confusing error 
> message:
> {noformat}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For 
> direct MetaStore DB connections, we don't support retries at the client 
> level.);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15676.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13415
[https://github.com/apache/spark/pull/13415]

> Disallow Column Names as Partition Columns For Hive Tables
> --
>
> Key: SPARK-15676
> URL: https://issues.apache.org/jira/browse/SPARK-15676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> Below is a common mistake users might make:
> {noformat}
> hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data 
> string, part string);
> FAILED: SemanticException [Error 10035]: Column repeated in partitioning 
> columns
> {noformat}
> Different from what Hive returned, currently, we return a confusing error 
> message:
> {noformat}
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For 
> direct MetaStore DB connections, we don't support retries at the client 
> level.);
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-13 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328200#comment-15328200
 ] 

Herman van Hovell commented on SPARK-15822:
---

[~robbinspg] Could you try this without caching?

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328205#comment-15328205
 ] 

Xuan Wang commented on SPARK-15924:
---

I then realized that this is not a problem with SparkR, so I closed the issue. 
Thanks!

> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15530.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13444
[https://github.com/apache/spark/pull/13444]

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15530:
-
Assignee: Takeshi Yamamuro

> Partitioning discovery logic HadoopFsRelation should use a higher setting of 
> parallelism
> 
>
> Key: SPARK-15530
> URL: https://issues.apache.org/jira/browse/SPARK-15530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418,
>  we launch a spark job to do parallel file listing in order to discover 
> partitions. However, we do not set the number of partitions at here, which 
> means that we are using the default parallelism of the cluster. It is better 
> to set the number of partitions explicitly to generate smaller tasks, which 
> help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-13 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15889.
--
Resolution: Fixed

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-13 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15889:
-
Fix Version/s: 2.0.0

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245
 ] 

Bryan Cutler commented on SPARK-15861:
--

[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function the takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328255#comment-15328255
 ] 

Saisai Shao commented on SPARK-15690:
-

Hi [~rxin], what's the meaning of "single-process", is that referring to 
something similar to local mode? 

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5374) abstract RDD's DAG graph iteration in DAGScheduler

2016-06-13 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-5374.
-
Resolution: Duplicate

Closing this because it duplicates the more narrowly-scoped JIRAs linked above.

> abstract RDD's DAG graph iteration in DAGScheduler
> --
>
> Key: SPARK-5374
> URL: https://issues.apache.org/jira/browse/SPARK-5374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Wenchen Fan
>
> DAGScheduler has many methods that iterate an RDD's DAG graph, we should 
> abstract the iterate process to reduce code size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245
 ] 

Bryan Cutler edited comment on SPARK-15861 at 6/13/16 9:05 PM:
---

[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function that takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.


was (Author: bryanc):
[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function the takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328264#comment-15328264
 ] 

Reynold Xin commented on SPARK-15690:
-

Yup. Eventually we can also generalize this to multiple process (e.g. cluster).

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328266#comment-15328266
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Do you have a reproducer? What does your code look like?

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-15928) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-13 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout deleted SPARK-15928:
---


> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15928
> URL: https://issues.apache.org/jira/browse/SPARK-15928
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-15929:
--

 Summary: DataFrameSuite path globbing error message tests are not 
fully portable
 Key: SPARK-15929
 URL: https://issues.apache.org/jira/browse/SPARK-15929
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
because they attempt to glob over all of {{/mnt}} and some of the 
subdirectories in there have restrictive permissions which cause the test to 
fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328329#comment-15328329
 ] 

Apache Spark commented on SPARK-15929:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13649

> DataFrameSuite path globbing error message tests are not fully portable
> ---
>
> Key: SPARK-15929
> URL: https://issues.apache.org/jira/browse/SPARK-15929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
> because they attempt to glob over all of {{/mnt}} and some of the 
> subdirectories in there have restrictive permissions which cause the test to 
> fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9623:
---

Assignee: (was: Apache Spark)

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328343#comment-15328343
 ] 

Apache Spark commented on SPARK-9623:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/13650

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-13 Thread John Aherne (JIRA)

John Aherne created SPARK-15930:
---

 Summary: Add Row count property to FPGrowth model
 Key: SPARK-15930
 URL: https://issues.apache.org/jira/browse/SPARK-15930
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.1
Reporter: John Aherne
Priority: Minor


Add a row count property to MLlib's FPGrowth model. 

When using the model from FPGrowth, a count of the total number of records is 
often necessary. 

It appears that the function already calculates that value when training the 
model, so it would save time not having to do it again outside the model. 

Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9623:
---

Assignee: Apache Spark

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328378#comment-15328378
 ] 

Saisai Shao commented on SPARK-15690:
-

I see. Since everything is in a single process, looks like netty layer could be 
by-passed and directly fetched the memory blocks in the reader side. It should 
definitely be faster than the current implementation.

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Greg Bowyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328380#comment-15328380
 ] 

Greg Bowyer commented on SPARK-15861:
-

... Hum from my end-users testing it does not seem to fail if the map function 
does not return a valid sequence

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15753) Move some Analyzer stuff to Analyzer from DataFrameWriter

2016-06-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328381#comment-15328381
 ] 

Wenchen Fan commented on SPARK-15753:
-

this is reverted, see discussion 
https://github.com/apache/spark/pull/13496#discussion_r66724862

> Move some Analyzer stuff to Analyzer from DataFrameWriter
> -
>
> Key: SPARK-15753
> URL: https://issues.apache.org/jira/browse/SPARK-15753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> DataFrameWriter.insertInto includes some Analyzer stuff. We should move it to 
> Analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-13 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15887.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13611
[https://github.com/apache/spark/pull/13611]

> Bring back the hive-site.xml support for Spark 2.0
> --
>
> Key: SPARK-15887
> URL: https://issues.apache.org/jira/browse/SPARK-15887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, 
> it seems make sense to still load this conf file.
> Originally, this file was loaded when we load HiveConf class and all settings 
> can be retrieved after we create a HiveConf instances. Let's avoid of using 
> this way to load hive-site.xml. Instead, since hive-site.xml is a normal 
> hadoop conf file, we can first find its url using the classloader and then 
> use Hadoop Configuration's addResource (or add hive-site.xml as a default 
> resource through Configuration.addDefaultResource) to load confs.
> Please note that hive-site.xml needs to be loaded into the hadoop conf used 
> to create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328390#comment-15328390
 ] 

Reynold Xin commented on SPARK-15690:
-

Yes there is definitely no reason to go through network for a single process. 
Technically we can even bypass the entire DAGScheduler, although that might be 
too much work.


> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-15931:
--

 Summary: SparkR tests failing on R 3.3.0
 Key: SPARK-15931
 URL: https://issues.apache.org/jira/browse/SPARK-15931
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Cheng Lian


Environment:

# Spark master Git revision: 
[f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
# R version: 3.3.0

To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
Relevant log lines:
{noformat}
...
Failed -
1. Failure: Check masked functions (@test_context.R#44) 
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 3 - 5 == -2


2. Failure: Check masked functions (@test_context.R#45) 
sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
Lengths differ: 3 vs 5
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328403#comment-15328403
 ] 

Cheng Lian commented on SPARK-15931:


cc [~mengxr]

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-06-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328405#comment-15328405
 ] 

Dongjoon Hyun commented on SPARK-15918:
---

Hi, [~Prabhu Joseph].
Instead of changing one of the tables, you just need to use explicit `select`.

If  `df1(a,b)` and `df2(b,a)`, please do the followings.
{code}
df1.union(df2.select("a", "b"))
{code}

IMHO, this is not a problem.

> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNH

[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328417#comment-15328417
 ] 

Bryan Cutler commented on SPARK-15861:
--

{{mapPartitions}} will expect the function to return a sequence, that's what 
you are referring to right?

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15776) Type coercion incorrect

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328420#comment-15328420
 ] 

Apache Spark commented on SPARK-15776:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/13651

> Type coercion incorrect
> ---
>
> Key: SPARK-15776
> URL: https://issues.apache.org/jira/browse/SPARK-15776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark based on commit 
> 26c1089c37149061f838129bb53330ded68ff4c9
>Reporter: Weizhong
>Priority: Minor
>
> {code:sql}
> CREATE TABLE cdr (
>   debet_dt  int  ,
>   srv_typ_cdstring   ,
>   b_brnd_cd smallint ,
>   call_dur  int
> )
> ROW FORMAT delimited fields terminated by ','
> STORED AS TEXTFILE;
> {code}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 
> END): bigint
> Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) 
> ELSE 0 END)#27L]
> +- Sort [debet_dt#16 ASC], true
>+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 
> LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 
> as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur 
> / 60) ELSE 0 END)#27L]
>   +- MetastoreRelation default, cdr
> {noformat}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) 
> THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) 
> END): double
> Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS 
> INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS 
> DOUBLE) END)#87]
> +- Sort [debet_dt#76 ASC], true
>+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 
> as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as 
> double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) 
> IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) 
> ELSE CAST(0 AS DOUBLE) END)#87]
>   +- MetastoreRelation default, cdr
> {noformat}
> The only difference is WHEN condition, but will result different output 
> column type(one is bigint, one is double) 
> We need to apply "Division" before "FunctionArgumentConversion", like below:
> {code:java}
> val typeCoercionRules =
> PropagateTypes ::
>   InConversion ::
>   WidenSetOperationTypes ::
>   PromoteStrings ::
>   DecimalPrecision ::
>   BooleanEquality ::
>   StringToIntegralCasts ::
>   Division ::
>   FunctionArgumentConversion ::
>   CaseWhenCoercion ::
>   IfCoercion ::
>   PropagateTypes ::
>   ImplicitTypeCasts ::
>   DateTimeOperations ::
>   Nil
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-13 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328432#comment-15328432
 ] 

Shivaram Venkataraman commented on SPARK-15931:
---

cc [~felixcheung] We should print out what are the names of the methods in 
expected vs actual as this has failed before as well

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion

2016-06-13 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328466#comment-15328466
 ] 

Kay Ousterhout commented on SPARK-15176:


I thought about this a little more and I think I'm in favor of maxShare instead 
of maxRunningTasks.  The reason is that maxRunningTasks seems brittle to the 
underlying setup -- if someone configures a certain maximum number of tasks, 
and then a few machines die, the maximum may no longer be reasonable (e.g., it 
may become larger than the number of machines in the cluster).  The other 
benefit is symmetry with minShare, as Mark mentioned.

[~njw45] why did you chose maxRunningTasks, as opposed to maxShare?  Are there 
other reasons that maxRunningTasks makes more sense?

> Job Scheduling Within Application Suffers from Priority Inversion
> -
>
> Key: SPARK-15176
> URL: https://issues.apache.org/jira/browse/SPARK-15176
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Nick White
>
> Say I have two pools, and N cores in my cluster:
> * I submit a job to one, which has M >> N tasks
> * N of the M tasks are scheduled
> * I submit a job to the second pool - but none of its tasks get scheduled 
> until a task from the other pool finishes!
> This can lead to unbounded denial-of-service for the second pool - regardless 
> of `minShare` or `weight` settings. Ideally Spark would support a pre-emption 
> mechanism, or an upper bound on a pool's resource usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR

2016-06-13 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15925.
---
Resolution: Fixed

Issue resolved by pull request 13644
[https://github.com/apache/spark/pull/13644]

> Replaces registerTempTable with createOrReplaceTempView in SparkR
> -
>
> Key: SPARK-15925
> URL: https://issues.apache.org/jira/browse/SPARK-15925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328487#comment-15328487
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Summary: Add deprecated method back to SQLContext for source code backward 
compatiblity  (was: Add deprecated method back to SQLContext for backward 
compatiblity)

> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain backward-compatibiity. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-15932:
---

 Summary: document the contract of encoder serializer expressions
 Key: SPARK-15932
 URL: https://issues.apache.org/jira/browse/SPARK-15932
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity

2016-06-13 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-15914:
---
Description: 
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain source code level backward 
compatibility. 

  was:
We removed some deprecated method in SQLContext in branch Spark 2.0.

For example:
{code}
  @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
  def jsonFile(path: String): DataFrame = {
read.json(path)
  }
{code}

These deprecated method may be used by existing third party data source. We 
probably want to add them back to remain backward-compatibiity. 


> Add deprecated method back to SQLContext for source code backward compatiblity
> --
>
> Key: SPARK-15914
> URL: https://issues.apache.org/jira/browse/SPARK-15914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> We removed some deprecated method in SQLContext in branch Spark 2.0.
> For example:
> {code}
>   @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
>   def jsonFile(path: String): DataFrame = {
> read.json(path)
>   }
> {code}
> These deprecated method may be used by existing third party data source. We 
> probably want to add them back to remain source code level backward 
> compatibility. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328586#comment-15328586
 ] 

Apache Spark commented on SPARK-15932:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13648

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15932:


Assignee: Apache Spark  (was: Wenchen Fan)

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15932:


Assignee: Wenchen Fan  (was: Apache Spark)

> document the contract of encoder serializer expressions
> ---
>
> Key: SPARK-15932
> URL: https://issues.apache.org/jira/browse/SPARK-15932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328591#comment-15328591
 ] 

Tejas Patil commented on SPARK-15905:
-

[~zsxwing] : This does not repro consistently but happens one off cases.. that 
too over different jobs. I have seen this 3-4 times in last week. The type of 
jobs I was running were pure SQL queries with SELECT, JOINs and GROUP BY. Sorry 
I cannot share the exact query neither the data. But I am quite positive that 
this problem would have nothing to do with the query being ran.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-3155:
---
Comment: was deleted

(was: I would like to add support for pruning DecisionTrees as part of my 
internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?
)

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2016-06-13 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328592#comment-15328592
 ] 

Manoj Kumar commented on SPARK-3155:


I would like to add support for pruning DecisionTrees as part of my internship.

Some API related questions:

Support for DecisionTree pruning in R is done in this way:

prune(fit, cp=)

A very straightforward extension would be to start would be to:

model.prune(validationData, errorTol=)

where model is a fit DecisionTreeRegressionModel would stop pruning when the 
improvement in error is not above a certain tolerance. Does that sound like a 
good idea?


> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328597#comment-15328597
 ] 

Shixiong Zhu commented on SPARK-15905:
--

[~tejasp] Probably some deadlock in Spark. It would be great if you can provide 
the full jstack output.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-15933:
-

 Summary: Refactor reader-writer interface for streaming DFs to use 
DataStreamReader/Writer
 Key: SPARK-15933
 URL: https://issues.apache.org/jira/browse/SPARK-15933
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, the DataFrameReader/Writer has method that are needed for streaming 
and non-streaming DFs. This is quite awkward because each method in them 
through runtime exception for one case or the other. So rather having half the 
methods throw runtime exceptions, its just better to have a different 
reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328608#comment-15328608
 ] 

Tejas Patil commented on SPARK-15905:
-

Another instance but this time not via console progress bar. This job has been 
stuck for 15+ hours.

{noformat}
"dispatcher-event-loop-23" #60 daemon prio=5 os_prio=0 tid=0x7f981e206000 
nid=0x685f8 runnable [0x7f8c0f1ef000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
- locked <0x7f8d48167058> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:480)
- locked <0x7f8d48167020> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
- locked <0x7f8d48237680> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
- locked <0x7f8d48235ee0> (a org.apache.log4j.ConsoleAppender)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
- locked <0x7f8d481bf1e8> (a org.apache.log4j.spi.RootLogger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:400)
at org.apache.spark.Logging$class.logWarning(Logging.scala:70)
at 
org.apache.spark.scheduler.TaskSetManager.logWarning(TaskSetManager.scala:52)
at 
org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:721)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:813)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:807)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:807)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at 
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:536)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:474)
- locked <0x7f8d5850e1e0> (a 
org.apache.spark.scheduler.TaskSchedulerImpl)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:263)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:202)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.sp

[jira] [Commented] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328609#comment-15328609
 ] 

Apache Spark commented on SPARK-15933:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13653

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15933:


Assignee: Apache Spark  (was: Tathagata Das)

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15933:


Assignee: Tathagata Das  (was: Apache Spark)

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Do you have the whole jstack output? I guess some places holds the lock of 
`System.err` but needs the whole output for all threads to find the place.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627
 ] 

Shixiong Zhu edited comment on SPARK-15905 at 6/13/16 11:42 PM:


Do you have the whole jstack output? I guess some place holds the lock of 
`System.err` but needs the whole output for all threads to find the place.


was (Author: zsxwing):
Do you have the whole jstack output? I guess some places holds the lock of 
`System.err` but needs the whole output for all threads to find the place.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328638#comment-15328638
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Oh, the thread state is `RUNNABLE`. So not a deadlock. Could you check you 
disk? Maybe some bad disks cause the hang.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328640#comment-15328640
 ] 

Shixiong Zhu commented on SPARK-15905:
--

By the way, how did you use Spark? Did you just run it or call it via some 
Process APIs?

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328648#comment-15328648
 ] 

Shixiong Zhu commented on SPARK-15905:
--

The last time I encounter FileOutputStream.writeBytes hangs is because I 
created a Process in Java but didn't consume its input stream and error stream. 
Finally, the underlying buffer was full and blocked the Process.

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable

2016-06-13 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15929.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13649
[https://github.com/apache/spark/pull/13649]

> DataFrameSuite path globbing error message tests are not fully portable
> ---
>
> Key: SPARK-15929
> URL: https://issues.apache.org/jira/browse/SPARK-15929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> The DataFrameSuite regression tests for SPARK-13774 fail in my environment 
> because they attempt to glob over all of {{/mnt}} and some of the 
> subdirectories in there have restrictive permissions which cause the test to 
> fail. I think we should rewrite this test to not depend existing / OS paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328699#comment-15328699
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Ismael and Cody,
My personal opinion was to hold off because a) The new consumer API was still 
marked as beta, and so I wasn't sure of the compatibility guarantees, which 
Kafka did seem to break a little (as discussed 
[here|http://mail-archives.apache.org/mod_mbox/kafka-dev/201605.mbox/%3CCAKm=r7v5jgg9qxgjioczdph9vej57m46ngy_626kiq-ovdx...@mail.gmail.com%3E])
 b) the real benefit is security - I am personally a little more biased towards 
authentication (Kerberos) than encryption, so I was just waiting for delegation 
tokens to land. 

Now, that 0.10.0 is released, there's a good chance delegation tokens would 
land in Kafka 0.11.0, and the new consumer API is marked stable, I am more open 
to this PR being merged, it's been around for too long anyways. Cody, what do 
you say? Any reason you'd want to wait? If not, we can make a case for this 
going in now.

As far the logistics of whether this belongs in Apache Bahir or not - today, I 
don't have a strong opinion on where kafka integration should reside. What I do 
feel strongly about, like Cody said, is that the old consumer API integration 
and new consumer API integration should reside in the same place. Since the old 
integration is in Spark, that's where the new makes sense. If a vote on Apache 
Spark results in Kafka integration to be taken out, both the new and the old in 
Apache Bahir would make sense.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15934) Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov (JIRA)

Egor Pahomov created SPARK-15934:


 Summary: Return binary mode in ThriftServer
 Key: SPARK-15934
 URL: https://issues.apache.org/jira/browse/SPARK-15934
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Egor Pahomov


In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
was default and it turned off in 2.0.0.

Just to describe magnitude of harm not fixing this bug would do in my 
organization:

* Tableau works only though Thrift Server and only with binary format. Tableau 
would not work with spark-2.0.0 at all!
* I have bunch of analysts in my organization with configured sql 
clients(DataGrip and Squirrel). I would need to go one by one to change 
connection string for them(DataGrip). Squirrel simply do not work with http - 
some jar hell in my case.
* let me not mention all other stuff which connects to our data infrastructure 
through ThriftServer as gateway. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328707#comment-15328707
 ] 

Cody Koeninger commented on SPARK-12177:


I don't think waiting for 0.11 makes sense.



> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-13 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328717#comment-15328717
 ] 

Shivaram Venkataraman commented on SPARK-15690:
---

Yeah I dont think you'll see much improvement from avoiding the DAGScheduler. 
One more thing to try here is to avoid serialization / deserialization unless 
you are going to spill to disk. That'll save a lot of time inside a single node.

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-13 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-15935:


 Summary: Enable test for sql/streaming.py and fix these tests
 Key: SPARK-15935
 URL: https://issues.apache.org/jira/browse/SPARK-15935
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15910:

Assignee: Sean Owen

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15910:

Assignee: Sean Zhong  (was: Sean Owen)

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder

2016-06-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15910.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13632
[https://github.com/apache/spark/pull/13632]

> Schema is not checked when converting DataFrame to Dataset using Kryo encoder
> -
>
> Key: SPARK-15910
> URL: https://issues.apache.org/jira/browse/SPARK-15910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Fix For: 2.0.0
>
>
> Here is the case to reproduce it:
> {code}
> scala> import org.apache.spark.sql.Encoders._
> scala> import org.apache.spark.sql.Encoders
> scala> import org.apache.spark.sql.Encoder
> scala> case class B(b: Int)
> scala> implicit val encoder = Encoders.kryo[B]
> encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary]
> scala> val ds = Seq((1)).toDF("b").as[B].map(identity)
> ds: org.apache.spark.sql.Dataset[B] = [value: binary]
> scala> ds.show()
> 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 45, Column 168: No applicable constructor/method found for actual parameters 
> "int"; candidates are: "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer 
> java.nio.ByteBuffer.wrap(byte[], int, int)"
> ...
> {code}
> The expected behavior is to report schema check failure earlier when creating 
> Dataset using {code}dataFrame.as[B]{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)

2016-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328723#comment-15328723
 ] 

Apache Spark commented on SPARK-15868:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/13654

> Executors table in Executors tab should sort Executor IDs in numerical order 
> (not alphabetical order)
> -
>
> Key: SPARK-15868
> URL: https://issues.apache.org/jira/browse/SPARK-15868
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-sorting-2.png, 
> spark-webui-executors-sorting.png
>
>
> It _appears_ that Executors table in Executors tab sorts Executor IDs in 
> alphabetical order while it should in numerical. It does sorting in a more 
> "friendly" way yet driver executor appears between 0 and 1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 >

101 - 200 of 235 matches

Mail list logo