[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060650#comment-15060650 ] Jakob Odersky commented on SPARK-12350: --- A git-bisect showed that the issue was introduced in 4a46b8859d3314b5b45a67cdc5c81fecb6e9e78c, a commit that fixes SPARK-11563. [~vanzin], any idea what could have gone wrong? > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12374) Improve performance of range API via adding logical/physical operators
Xiao Li created SPARK-12374: --- Summary: Improve performance of range API via adding logical/physical operators Key: SPARK-12374 URL: https://issues.apache.org/jira/browse/SPARK-12374 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Xiao Li Priority: Critical Creating an actual logical/physical operator for range for matching the performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: Apache Spark > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: (was: Apache Spark) > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12350: Assignee: Apache Spark > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky >Assignee: Apache Spark > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12372) Unary operator "-" fails for MLlib vectors
[ https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060858#comment-15060858 ] Christos Iraklis Tsatsoulis commented on SPARK-12372: - If this is the case, then a warning/clarification in the documentation wouldn't hurt - Spark users are not supposed to be aware of the internal "ongoing discussions" between Spark developers (BTW, any relevant link would be very welcome - I could not find any mention in MLlib & Breeze docs, neither in the recent preprint papers on linalg & MLlib). All in all, I suggest you re-open the issue with a different type (it's not a bug, as you say), and the required resolution being a notification in the relevant docs ("don't try this..., because..."). > Unary operator "-" fails for MLlib vectors > -- > > Key: SPARK-12372 > URL: https://issues.apache.org/jira/browse/SPARK-12372 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.2 >Reporter: Christos Iraklis Tsatsoulis > > Consider the following snippet in pyspark 1.5.2: > {code:none} > >>> from pyspark.mllib.linalg import Vectors > >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> x > DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> -x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> y > DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> x-y > DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) > >>> -y+x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> -1*x > DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) > {code} > Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors > for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} > behaves as expected. > The last operation, {{-1*x}}, although mathematically "correct", includes > minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060873#comment-15060873 ] Apache Spark commented on SPARK-12350: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/10337 > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky >Assignee: Apache Spark > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060732#comment-15060732 ] Jakob Odersky commented on SPARK-12350: --- No functionality is broken, so if the exception can be silenced it would be a possible fix. However, even if there is no loss of functionality, should the exception not be treated as an error? > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060819#comment-15060819 ] Jakob Odersky commented on SPARK-12350: --- Ok, thanks! > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060875#comment-15060875 ] Joseph K. Bradley commented on SPARK-12331: --- +1 for this change based on the description (though I haven't checked the code & references myself). CCing [~dbtsai] It'd be great to add a unit test comparing with R results on the same data. > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Priority: Minor > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12371) Make sure Dataset nullability conforms to its underlying logical plan
Cheng Lian created SPARK-12371: -- Summary: Make sure Dataset nullability conforms to its underlying logical plan Key: SPARK-12371 URL: https://issues.apache.org/jira/browse/SPARK-12371 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.6.0, 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Currently it's possible to construct a Dataset with different nullability from its underlying logical plan, which should be caught during analysis phase: {code} val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null))) val schema = StructType(Seq(StructField("_1", StringType, nullable = false))) val df = sqlContext.createDataFrame(rowRDD, schema) df.as[Tuple1[String]].collect().foreach(println) // Output: // // (hello) // (null) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060475#comment-15060475 ] Apache Spark commented on SPARK-12345: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10332 > Mesos cluster mode is broken > > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12054) Consider nullable in codegen
[ https://issues.apache.org/jira/browse/SPARK-12054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060489#comment-15060489 ] Apache Spark commented on SPARK-12054: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10333 > Consider nullable in codegen > > > Key: SPARK-12054 > URL: https://issues.apache.org/jira/browse/SPARK-12054 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Currently, we always check the nullability for results of expressions, we > could skip that if the expression is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test
[ https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12361: --- Assignee: Jeff Zhang > Should set PYSPARK_DRIVER_PYTHON before python test > --- > > Key: SPARK-12361 > URL: https://issues.apache.org/jira/browse/SPARK-12361 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 1.6.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > > If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may > happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is > that this exception won't cause the unit test fail. The return_code is still > 0 which hide the unit test failure. And if I invoke the test command > directly, I can see the return code is not 0. This is very weird. > * invoke unit test command directly > {code} > export SPARK_TESTING = 1 > export PYSPARK_PYTHON=python2.6 > bin/pyspark pyspark.ml.clustering > {code} > * return code from python unit test > {code} > retcode = subprocess.Popen( > [os.path.join(SPARK_HOME, "bin/pyspark"), test_name], > stderr=per_test_output, stdout=per_test_output, env=env).wait() > {code} > * exception of python version mismatch > {code} > File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.6 than that in driver > 2.7, PySpark cannot run with different minor versions > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12057) Prevent failure on corrupt JSON records
[ https://issues.apache.org/jira/browse/SPARK-12057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12057: - Target Version/s: 1.6.1, 2.0.0 (was: 1.6.0) > Prevent failure on corrupt JSON records > --- > > Key: SPARK-12057 > URL: https://issues.apache.org/jira/browse/SPARK-12057 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Macalinao >Priority: Minor > > Return failed record when a record cannot be parsed. Allows parsing of files > containing corrupt records of any form. Currently a corrupt record throws an > exception, causing the entire job to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12273) Spark Streaming Web UI does not list Receivers in order
[ https://issues.apache.org/jira/browse/SPARK-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12273: - Assignee: Liwei Lin > Spark Streaming Web UI does not list Receivers in order > --- > > Key: SPARK-12273 > URL: https://issues.apache.org/jira/browse/SPARK-12273 > Project: Spark > Issue Type: Improvement > Components: Streaming, Web UI >Affects Versions: 1.5.2 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Minor > Fix For: 2.0.0 > > Attachments: Spark-12273.png > > > Currently the Streaming web UI does NOT list Receivers in order, while it > seems more convenient for the users if Receivers are listed in order. > !Spark-12273.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test
[ https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12361: --- Target Version/s: (was: 1.6.1) > Should set PYSPARK_DRIVER_PYTHON before python test > --- > > Key: SPARK-12361 > URL: https://issues.apache.org/jira/browse/SPARK-12361 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 1.6.0 >Reporter: Jeff Zhang >Priority: Minor > > If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may > happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is > that this exception won't cause the unit test fail. The return_code is still > 0 which hide the unit test failure. And if I invoke the test command > directly, I can see the return code is not 0. This is very weird. > * invoke unit test command directly > {code} > export SPARK_TESTING = 1 > export PYSPARK_PYTHON=python2.6 > bin/pyspark pyspark.ml.clustering > {code} > * return code from python unit test > {code} > retcode = subprocess.Popen( > [os.path.join(SPARK_HOME, "bin/pyspark"), test_name], > stderr=per_test_output, stdout=per_test_output, env=env).wait() > {code} > * exception of python version mismatch > {code} > File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.6 than that in driver > 2.7, PySpark cannot run with different minor versions > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12373) Type coercion rule of dividing two decimal values may choose an intermediate precision that does not have enough number of digits at the left of decimal point
[ https://issues.apache.org/jira/browse/SPARK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12373: - Summary: Type coercion rule of dividing two decimal values may choose an intermediate precision that does not have enough number of digits at the left of decimal point (was: Type coercion rule of dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point ) > Type coercion rule of dividing two decimal values may choose an intermediate > precision that does not have enough number of digits at the left of decimal > point > --- > > Key: SPARK-12373 > URL: https://issues.apache.org/jira/browse/SPARK-12373 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Looks like the {{widerDecimalType}} at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432 > can produce something like {{(38, 38)}} when we have have two operand types > {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there > is more reasonable way to handle precision/scale. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060740#comment-15060740 ] Marcelo Vanzin commented on SPARK-12350: bq. should the exception not be treated as an error? No because the class might exist in other class loaders in the chain, as is the case here. > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12364) Add ML example for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12364: -- Assignee: Yanbo Liang (was: Apache Spark) > Add ML example for SparkR > - > > Key: SPARK-12364 > URL: https://issues.apache.org/jira/browse/SPARK-12364 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 1.6.1, 2.0.0 > > > Add ML example for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12364) Add ML example for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12364. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10324 [https://github.com/apache/spark/pull/10324] > Add ML example for SparkR > - > > Key: SPARK-12364 > URL: https://issues.apache.org/jira/browse/SPARK-12364 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > Fix For: 2.0.0, 1.6.1 > > > Add ML example for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12363: -- Priority: Minor (was: Major) > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Yanbo Liang >Priority: Minor > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060862#comment-15060862 ] Joseph K. Bradley commented on SPARK-12363: --- Thanks for identifying this. What do the predictions look like? Does it improve if you increase the number of iterations KMeans runs for when called from PIC? It seems like an intuitively reasonable test, but I could see it failing for bad initial cluster centers or if KMeans needs to run for more iterations. > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Yanbo Liang > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs
[ https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12304: - Assignee: Liwei Lin > Make Spark Streaming web UI display more friendly Receiver graphs > - > > Key: SPARK-12304 > URL: https://issues.apache.org/jira/browse/SPARK-12304 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.5.2, 1.6.0 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Minor > Fix For: 2.0.0 > > Attachments: after-5.png, before-5.png > > > Currently, the Spark Streaming web UI uses the same maxY when displays 'Input > Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. > This may lead to somewhat un-friendly graphs: once we have tens of Receivers > or more, every 'Per-Receiver Times' line almost hits the ground. > This issue proposes to calculate a new maxY against the original one, which > is shared among all the `Per-Receiver Times& Histograms' graphs. > Before: > !before-5.png! > After: > !after-5.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12372) Unary operator "-" fails for MLlib vectors
[ https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060699#comment-15060699 ] Joseph K. Bradley commented on SPARK-12372: --- There simply isn't a unary operation. There are ongoing discussions about turning MLlib vectors and matrices into a full-fledged local linear algebra library, but currently, you could convert to numpy/scipy and use those library for pyspark. > Unary operator "-" fails for MLlib vectors > -- > > Key: SPARK-12372 > URL: https://issues.apache.org/jira/browse/SPARK-12372 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.2 >Reporter: Christos Iraklis Tsatsoulis > > Consider the following snippet in pyspark 1.5.2: > {code:none} > >>> from pyspark.mllib.linalg import Vectors > >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> x > DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> -x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> y > DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> x-y > DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) > >>> -y+x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> -1*x > DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) > {code} > Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors > for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} > behaves as expected. > The last operation, {{-1*x}}, although mathematically "correct", includes > minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12372) Unary operator "-" fails for MLlib vectors
[ https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-12372. - Resolution: Not A Problem > Unary operator "-" fails for MLlib vectors > -- > > Key: SPARK-12372 > URL: https://issues.apache.org/jira/browse/SPARK-12372 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.2 >Reporter: Christos Iraklis Tsatsoulis > > Consider the following snippet in pyspark 1.5.2: > {code:none} > >>> from pyspark.mllib.linalg import Vectors > >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> x > DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> -x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> y > DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> x-y > DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) > >>> -y+x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> -1*x > DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) > {code} > Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors > for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} > behaves as expected. > The last operation, {{-1*x}}, although mathematically "correct", includes > minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: Apache Spark > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: (was: Apache Spark) > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060773#comment-15060773 ] Jakob Odersky commented on SPARK-12350: --- Ok, but then why throw an exception in the first place? > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: Apache Spark > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12364) Add ML example for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12364: -- Assignee: Yanbo Liang > Add ML example for SparkR > - > > Key: SPARK-12364 > URL: https://issues.apache.org/jira/browse/SPARK-12364 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add ML example for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12361) Should set PYSPARK_DRIVER_PYTHON before python test
[ https://issues.apache.org/jira/browse/SPARK-12361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12361. Resolution: Fixed Fix Version/s: 2.0.0 Fixed by https://github.com/apache/spark/pull/10322 > Should set PYSPARK_DRIVER_PYTHON before python test > --- > > Key: SPARK-12361 > URL: https://issues.apache.org/jira/browse/SPARK-12361 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 1.6.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.0.0 > > > If PYSPARK_DRIVER_PYTHON is not set, python version mismatch exception may > happen (when I set PYSPARK_DRIVER_PYTHON in .profile). And the weird thing is > that this exception won't cause the unit test fail. The return_code is still > 0 which hide the unit test failure. And if I invoke the test command > directly, I can see the return code is not 0. This is very weird. > * invoke unit test command directly > {code} > export SPARK_TESTING = 1 > export PYSPARK_PYTHON=python2.6 > bin/pyspark pyspark.ml.clustering > {code} > * return code from python unit test > {code} > retcode = subprocess.Popen( > [os.path.join(SPARK_HOME, "bin/pyspark"), test_name], > stderr=per_test_output, stdout=per_test_output, env=env).wait() > {code} > * exception of python version mismatch > {code} > File "/Users/jzhang/github/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.6 than that in driver > 2.7, PySpark cannot run with different minor versions > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12373) Type coercion rule of dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point
[ https://issues.apache.org/jira/browse/SPARK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12373: - Summary: Type coercion rule of dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point (was: Type coercion rule for dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point ) > Type coercion rule of dividing two decimal values may choose an intermediate > precision that does not enough number of digits at the left of decimal point > -- > > Key: SPARK-12373 > URL: https://issues.apache.org/jira/browse/SPARK-12373 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Looks like the {{widerDecimalType}} at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432 > can produce something like {{(38, 38)}} when we have have two operand types > {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there > is more reasonable way to handle precision/scale. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12364) Add ML example for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12364: Assignee: Apache Spark (was: Yanbo Liang) > Add ML example for SparkR > - > > Key: SPARK-12364 > URL: https://issues.apache.org/jira/browse/SPARK-12364 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > > Add ML example for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060887#comment-15060887 ] Joseph K. Bradley commented on SPARK-12326: --- The plan sounds good. The critical item is #1 of course since that will let us improve GBTs in spark.ml. For #2, I'd also recommend we take this opportunity to make some of those helper classes private when possible (especially if they are only needed during training) and maybe change the APIs (especially if we can eliminate duplicate data stored in the final model). Can you please make 1 subtask for each of these 4 steps? Thanks! > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. > Steps 2, 3, and 4 should be in separate Jiras. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12373) Type coercion rule for dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point
Yin Huai created SPARK-12373: Summary: Type coercion rule for dividing two decimal values may choose an intermediate precision that does not enough number of digits at the left of decimal point Key: SPARK-12373 URL: https://issues.apache.org/jira/browse/SPARK-12373 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Looks like the {{widerDecimalType}} at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432 can produce something like {{(38, 38)}} when we have have two operand types {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there is more reasonable way to handle precision/scale. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11608) ML 1.6 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-11608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11608. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10235 [https://github.com/apache/spark/pull/10235] > ML 1.6 QA: Programming guide update and migration guide > --- > > Key: SPARK-11608 > URL: https://issues.apache.org/jira/browse/SPARK-11608 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.0.0, 1.6.1 > > > Before the release, we need to update the MLlib Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Possibly reorganize parts of the Pipelines guide if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12374: Summary: Improve performance of Range APIs via adding logical/physical operators (was: Improve performance of range API via adding logical/physical operators) > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12367) NoSuchElementException during prediction with Random Forest Regressor
[ https://issues.apache.org/jira/browse/SPARK-12367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060720#comment-15060720 ] Joseph K. Bradley commented on SPARK-12367: --- This is likely caused by a feature value 1.0 which did not appear in the training data. That prevents VectorIndexer from knowing about that value, so it does not have a corresponding index when trying to transform the test data. It will be handled by [SPARK-12375]. > NoSuchElementException during prediction with Random Forest Regressor > - > > Key: SPARK-12367 > URL: https://issues.apache.org/jira/browse/SPARK-12367 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 >Reporter: Eugene Morozov > Attachments: CodeThatGivesANoSuchElementException.java, > complete-stack-trace.log, input.gz > > > I'm consistently getting "java.util.NoSuchElementException: key not found: > 1.0" while trying to do a prediction on a trained model. > I use ml package - Pipeline API. The model is successfully trained, I see > some stats in the output: total, findSplitsBins, findBestSplits, > chooseSplits. I can even serialize it into a file and use afterwards, but the > prediction is broken somehow. > Code, input data and stack trace attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12367) NoSuchElementException during prediction with Random Forest Regressor
[ https://issues.apache.org/jira/browse/SPARK-12367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-12367. - Resolution: Duplicate > NoSuchElementException during prediction with Random Forest Regressor > - > > Key: SPARK-12367 > URL: https://issues.apache.org/jira/browse/SPARK-12367 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 >Reporter: Eugene Morozov > Attachments: CodeThatGivesANoSuchElementException.java, > complete-stack-trace.log, input.gz > > > I'm consistently getting "java.util.NoSuchElementException: key not found: > 1.0" while trying to do a prediction on a trained model. > I use ml package - Pipeline API. The model is successfully trained, I see > some stats in the output: total, findSplitsBins, findBestSplits, > chooseSplits. I can even serialize it into a file and use afterwards, but the > prediction is broken somehow. > Code, input data and stack trace attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12375) VectorIndexer: allow unknown categories
Joseph K. Bradley created SPARK-12375: - Summary: VectorIndexer: allow unknown categories Key: SPARK-12375 URL: https://issues.apache.org/jira/browse/SPARK-12375 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Add option for allowing unknown categories, probably via a parameter like "allowUnknownCategories." If true, then handle unknown categories during transform by assigning them to an extra category index. The API should resemble the API used for StringIndexer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060721#comment-15060721 ] Marcelo Vanzin commented on SPARK-12350: I understand where the exception is coming from, I'm asking whether there's any actual functionality broken by this or is it just about the ugly exception being printed to the terminal. It seems there's not, so it's just about silencing the exception. > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060865#comment-15060865 ] Joseph K. Bradley commented on SPARK-12363: --- Setting priority to Minor since we'll notice this bug when it becomes a bug. > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Yanbo Liang >Priority: Minor > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12273) Spark Streaming Web UI does not list Receivers in order
[ https://issues.apache.org/jira/browse/SPARK-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12273: - Affects Version/s: 1.6.0 > Spark Streaming Web UI does not list Receivers in order > --- > > Key: SPARK-12273 > URL: https://issues.apache.org/jira/browse/SPARK-12273 > Project: Spark > Issue Type: Improvement > Components: Streaming, Web UI >Affects Versions: 1.5.2, 1.6.0 >Reporter: Liwei Lin >Assignee: Liwei Lin >Priority: Minor > Fix For: 2.0.0 > > Attachments: Spark-12273.png > > > Currently the Streaming web UI does NOT list Receivers in order, while it > seems more convenient for the users if Receivers are listed in order. > !Spark-12273.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12057) Prevent failure on corrupt JSON records
[ https://issues.apache.org/jira/browse/SPARK-12057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-12057: Assignee: Yin Huai > Prevent failure on corrupt JSON records > --- > > Key: SPARK-12057 > URL: https://issues.apache.org/jira/browse/SPARK-12057 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Macalinao >Assignee: Yin Huai >Priority: Minor > > Return failed record when a record cannot be parsed. Allows parsing of files > containing corrupt records of any form. Currently a corrupt record throws an > exception, causing the entire job to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060668#comment-15060668 ] Marcelo Vanzin commented on SPARK-12350: So, if I understand correctly, the issue is just the scary log message, not because there's anything wrong with the functionality? > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12374: Issue Type: Improvement (was: Bug) > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060705#comment-15060705 ] Jakob Odersky commented on SPARK-12350: --- The end result seems to work, the console is however spammed with error messages. I think it is due to a `require` that fails in {{core/src/main/scala/org/apache/spark/rpc/netty/NettyStreamManager.scala}}, line 60. See my comment on https://github.com/apache/spark/commit/4a46b8859d3314b5b45a67cdc5c81fecb6e9e78c#commitcomment-15024736 > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: Apache Spark > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12374: Assignee: (was: Apache Spark) > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
Evan Chen created SPARK-12376: - Summary: Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method Key: SPARK-12376 URL: https://issues.apache.org/jira/browse/SPARK-12376 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.6.0 Environment: Oracle Java 64-bit (build 1.8.0_66-b17) Reporter: Evan Chen Priority: Minor org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method. Here are the errors: Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite testMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.217 sec <<< ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at java.util.Collections.sort(Collections.java:141) at org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) testFlatMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.203 sec <<< ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at java.util.Collections.sort(Collections.java:141) at org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) testFilter(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.209 sec <<< ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at java.util.Collections.sort(Collections.java:141) at org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) testTransform(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.215 sec <<< ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at java.util.List.sort(List.java:482) at java.util.Collections.sort(Collections.java:141) at org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) Results : Tests in error: Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 » UnsupportedOperation Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 » UnsupportedOperation Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 » UnsupportedOperation Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 » UnsupportedOperation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060788#comment-15060788 ] Marcelo Vanzin commented on SPARK-12350: Well, that's what the fix will be. > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12377) Wrong implementation for Row.__call__ in pyspark
Irakli Machabeli created SPARK-12377: Summary: Wrong implementation for Row.__call__ in pyspark Key: SPARK-12377 URL: https://issues.apache.org/jira/browse/SPARK-12377 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Irakli Machabeli Current code def __call__(self, *args): """create new Row object""" return _create_row(self, args) has to be def __call__(self, *args): """create new Row object""" return _create_row(self.__fields__, args) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken when SPARK_HOME is set
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12345: -- Summary: Mesos cluster mode is broken when SPARK_HOME is set (was: Mesos cluster mode is broken) > Mesos cluster mode is broken when SPARK_HOME is set > --- > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12345: -- Summary: Mesos cluster mode is broken (was: Mesos cluster mode is broken when SPARK_HOME is set) > Mesos cluster mode is broken > > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit
[ https://issues.apache.org/jira/browse/SPARK-12289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12289: Assignee: Apache Spark > Support UnsafeRow in TakeOrderedAndProject/Limit > > > Key: SPARK-12289 > URL: https://issues.apache.org/jira/browse/SPARK-12289 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit
[ https://issues.apache.org/jira/browse/SPARK-12289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060287#comment-15060287 ] Apache Spark commented on SPARK-12289: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/10330 > Support UnsafeRow in TakeOrderedAndProject/Limit > > > Key: SPARK-12289 > URL: https://issues.apache.org/jira/browse/SPARK-12289 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12318) Save mode in SparkR should be error by default
[ https://issues.apache.org/jira/browse/SPARK-12318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-12318: -- Assignee: Jeff Zhang > Save mode in SparkR should be error by default > -- > > Key: SPARK-12318 > URL: https://issues.apache.org/jira/browse/SPARK-12318 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.0.0 > > > The save mode in SparkR should be consistent with that of scala api -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12345) Mesos cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12345. --- Resolution: Fixed Assignee: Luc Bourlier (was: Apache Spark) > Mesos cluster mode is broken > > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Luc Bourlier >Priority: Critical > Fix For: 1.6.0 > > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12371) Make sure Dataset nullability conforms to its underlying logical plan
[ https://issues.apache.org/jira/browse/SPARK-12371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12371: Assignee: Apache Spark (was: Cheng Lian) > Make sure Dataset nullability conforms to its underlying logical plan > - > > Key: SPARK-12371 > URL: https://issues.apache.org/jira/browse/SPARK-12371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Currently it's possible to construct a Dataset with different nullability > from its underlying logical plan, which should be caught during analysis > phase: > {code} > val rowRDD = sqlContext.sparkContext.parallelize(Seq(Row("hello"), Row(null))) > val schema = StructType(Seq(StructField("_1", StringType, nullable = false))) > val df = sqlContext.createDataFrame(rowRDD, schema) > df.as[Tuple1[String]].collect().foreach(println) > // Output: > // > // (hello) > // (null) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12324) The documentation sidebar does not collapse properly
[ https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12324. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10297 [https://github.com/apache/spark/pull/10297] > The documentation sidebar does not collapse properly > > > Key: SPARK-12324 > URL: https://issues.apache.org/jira/browse/SPARK-12324 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Timothy Hunter >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png > > > When the browser's window is reduced horizontally, the sidebar slides under > the main content and does not collapse: > - hide the sidebar when the browser's width is not large enough > - add a button to show and hide the sidebar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12345: -- Fix Version/s: 1.6.0 > Mesos cluster mode is broken > > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > Fix For: 1.6.0 > > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6518) Add example code and user guide for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6518. -- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 9952 [https://github.com/apache/spark/pull/9952] > Add example code and user guide for bisecting k-means > - > > Key: SPARK-6518 > URL: https://issues.apache.org/jira/browse/SPARK-6518 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > Fix For: 2.0.0, 1.6.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites
[ https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12309. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10279 [https://github.com/apache/spark/pull/10279] > Use sqlContext from MLlibTestSparkContext for spark.ml test suites > -- > > Key: SPARK-12309 > URL: https://issues.apache.org/jira/browse/SPARK-12309 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > Use sqlContext from MLlibTestSparkContext rather than creating new one for > spark.ml test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10951) Support private S3 repositories using spark-submit via --repositories flag
[ https://issues.apache.org/jira/browse/SPARK-10951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060451#comment-15060451 ] Jerry Lam commented on SPARK-10951: --- Any change to have this feature in 1.6? :) > Support private S3 repositories using spark-submit via --repositories flag > -- > > Key: SPARK-10951 > URL: https://issues.apache.org/jira/browse/SPARK-10951 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 1.5.1 >Reporter: Jerry Lam > > Currently spark-submit allow users to specify remote repositories using > --repositories as a mean to use --packages to handle jars dependencies. > However, the remote repositories does not include private s3 repositories > which require aws credentials. It would be great to include a s3 resolver to > handle private s3 repositories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12215) User guide section for KMeans in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12215. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10244 [https://github.com/apache/spark/pull/10244] > User guide section for KMeans in spark.ml > - > > Key: SPARK-12215 > URL: https://issues.apache.org/jira/browse/SPARK-12215 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa > Fix For: 2.0.0, 1.6.1 > > > [~yuu.ishik...@gmail.com] Will you have time to add a user guide section for > this? Thanks in advance! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12360) Support using 64-bit long type in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060381#comment-15060381 ] Shivaram Venkataraman commented on SPARK-12360: --- The lack of 64 bit numbers is a limitation in R, but I'd like to understand the use-cases where this comes up before trying a complex fix. My understanding is that long values from JSON / HDFS / Parquet etc. will be read correctly because they go through the Scala layers and the problem only comes up when somebody does a collect / UDF ? If so I think the problem may not be that important as R users probably wouldn't expect long types to work on the R shell. Also it might lead to another solution where we don't add a dependency on bit64, but we check if bit64 is available and if so we avoid the truncation to double etc. > Support using 64-bit long type in SparkR > > > Key: SPARK-12360 > URL: https://issues.apache.org/jira/browse/SPARK-12360 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Sun Rui > > R has no support for 64-bit integers. While in Scala/Java API, some methods > have one or more arguments of long type. Currently we support only passing an > integer cast from a numeric to Scala/Java side for parameters of long type of > such methods. This may have problem covering large data sets. > Storing a 64-bit integer in a double obviously does not work as some 64-bit > integers can not be exactly represented in double format, so x and x+1 can't > be distinguished. > There is a bit64 package > (https://cran.r-project.org/web/packages/bit64/index.html) in CRAN which > supports vectors of 64-bit integers. We can investigate if it can be used for > this purpose. > two questions are: > 1. Is the license acceptable? > 2. This will have SparkR depends on a non-base third-party package, which > may complicate the deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12318) Save mode in SparkR should be error by default
[ https://issues.apache.org/jira/browse/SPARK-12318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12318. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10290 [https://github.com/apache/spark/pull/10290] > Save mode in SparkR should be error by default > -- > > Key: SPARK-12318 > URL: https://issues.apache.org/jira/browse/SPARK-12318 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Priority: Minor > Fix For: 2.0.0 > > > The save mode in SparkR should be consistent with that of scala api -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8745) Remove GenerateProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8745. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10316 [https://github.com/apache/spark/pull/10316] > Remove GenerateProjection > - > > Key: SPARK-8745 > URL: https://issues.apache.org/jira/browse/SPARK-8745 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > Fix For: 2.0.0 > > > Based on discussion offline with [~marmbrus], we should remove > GenerateProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12310) Add write.json and write.parquet for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12310. --- Resolution: Fixed Assignee: Yanbo Liang (was: Apache Spark) Fix Version/s: 2.0.0 1.6.1 Resolved by https://github.com/apache/spark/pull/10281 > Add write.json and write.parquet for SparkR > --- > > Key: SPARK-12310 > URL: https://issues.apache.org/jira/browse/SPARK-12310 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 1.6.1, 2.0.0 > > > Add write.json and write.parquet for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060500#comment-15060500 ] Jakob Odersky edited comment on SPARK-12350 at 12/16/15 6:55 PM: - You're right, somewhere in the huge stack trace output I also see the dataframe displayed as a table The error only occurs in the latest upstream was (Author: jodersky): You're right, somewhere in the huge stack trace output I also see the dataframe displayed as a table > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12345: -- Target Version/s: 1.6.0 (was: 1.6.1) > Mesos cluster mode is broken > > > Key: SPARK-12345 > URL: https://issues.apache.org/jira/browse/SPARK-12345 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > Fix For: 1.6.0 > > > The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2. > The driver is confused about where SPARK_HOME is. It resolves > `mesos.executor.uri` or `spark.mesos.executor.home` relative to the > filesystem where the driver runs, which is wrong. > {code} > I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0 > I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave > 130bdc39-44e7-4256-8c22-602040d337f1-S1 > bin/spark-submit: line 27: > /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: > No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060500#comment-15060500 ] Jakob Odersky commented on SPARK-12350: --- You're right, somewhere in the huge stack trace output I also see the dataframe displayed as a table > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: ML > Environment: sparkShell command from sbt >Reporter: Jakob Odersky > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly
[ https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12324: -- Target Version/s: 1.6.1, 2.0.0 > The documentation sidebar does not collapse properly > > > Key: SPARK-12324 > URL: https://issues.apache.org/jira/browse/SPARK-12324 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Timothy Hunter >Priority: Minor > Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png > > > When the browser's window is reduced horizontally, the sidebar slides under > the main content and does not collapse: > - hide the sidebar when the browser's width is not large enough > - add a button to show and hide the sidebar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12372) Unary operator "-" fails for MLlib vectors
Christos Iraklis Tsatsoulis created SPARK-12372: --- Summary: Unary operator "-" fails for MLlib vectors Key: SPARK-12372 URL: https://issues.apache.org/jira/browse/SPARK-12372 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.5.2 Reporter: Christos Iraklis Tsatsoulis Consider the following snippet in pyspark 1.5.2: {code:none} >>> from pyspark.mllib.linalg import Vectors >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) >>> x DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) >>> -x Traceback (most recent call last): File "", line 1, in TypeError: func() takes exactly 2 arguments (1 given) >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) >>> y DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) >>> x-y DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) >>> -y+x Traceback (most recent call last): File "", line 1, in TypeError: func() takes exactly 2 arguments (1 given) >>> -1*x DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) {code} Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} behaves as expected. The last operation, {{-1*x}}, although mathematically "correct", includes minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9694) Add random seed Param to Scala CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9694. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9108 [https://github.com/apache/spark/pull/9108] > Add random seed Param to Scala CrossValidator > - > > Key: SPARK-9694 > URL: https://issues.apache.org/jira/browse/SPARK-9694 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
[ https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12376: Assignee: (was: Apache Spark) > Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method > > > Key: SPARK-12376 > URL: https://issues.apache.org/jira/browse/SPARK-12376 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 > Environment: Oracle Java 64-bit (build 1.8.0_66-b17) >Reporter: Evan Chen >Priority: Minor > > org.apache.spark.streaming.Java8APISuite.java is failing due to trying to > sort immutable list in assertOrderInvariantEquals method. > Here are the errors: > Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec > <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite > testMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.217 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFlatMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.203 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFilter(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.209 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testTransform(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.215 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > Results : > Tests in error: > > Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
[ https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12376: Assignee: Apache Spark > Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method > > > Key: SPARK-12376 > URL: https://issues.apache.org/jira/browse/SPARK-12376 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 > Environment: Oracle Java 64-bit (build 1.8.0_66-b17) >Reporter: Evan Chen >Assignee: Apache Spark >Priority: Minor > > org.apache.spark.streaming.Java8APISuite.java is failing due to trying to > sort immutable list in assertOrderInvariantEquals method. > Here are the errors: > Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec > <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite > testMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.217 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFlatMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.203 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFilter(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.209 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testTransform(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.215 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > Results : > Tests in error: > > Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation
[ https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11834: -- Target Version/s: 1.6.1, 2.0.0 (was: 1.6.0) > Ignore thresholds in LogisticRegression and update documentation > > > Key: SPARK-11834 > URL: https://issues.apache.org/jira/browse/SPARK-11834 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > ml.LogisticRegression does not support multiclass yet. So we should ignore > `thresholds` and update the documentation. In the next release, we can do > SPARK-11543. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061051#comment-15061051 ] Joseph K. Bradley commented on SPARK-10931: --- I'd very strongly prefer not to modify every model. I believe we can save a lot of code by using a generic, shared implementation. Check out {{getattr}} here: [https://docs.python.org/2/library/functions.html] In the wrapper.py file in spark.ml, there are some abstractions defined. I'm hoping one of those can be modified to provide access to Params. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061064#comment-15061064 ] Joseph K. Bradley commented on SPARK-12272: --- First comment: I'd check the number of partitions and the Spark UI to make sure workers are doing equal amounts of work. Second comment: MLlib follows the PLANET implementation, so it will have trouble with that many features. There is ongoing work to overcome that issue: [SPARK-3717]; I hope to push that work into Spark within a couple of months. Third comment: My understanding of xgboost is that it trains each tree on a single worker, using a subset of the data (only the data on that 1 worker). This differs from other implementations, which train each tree on all of the data. This means xgboost does not have to communicate much data, but also means its trees cannot be as accurate individually; it's a trade-off. There is a JIRA for exploring xgboost on Spark: [SPARK-8547] I hope these 2 linked JIRAs will address your needs! > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png, training-log2.pnd.png, > training-log3.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts
[ https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-12272. - Resolution: Duplicate > Gradient boosted trees: too slow at the first finding best siplts > - > > Key: SPARK-12272 > URL: https://issues.apache.org/jira/browse/SPARK-12272 > Project: Spark > Issue Type: Request > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Wenmin Wu > Attachments: training-log1.png, training-log2.pnd.png, > training-log3.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061090#comment-15061090 ] Evan Chen commented on SPARK-10931: --- Hey Joseph, If using the getattr method, are you suggesting fetching the parameter straight from the Model java object or from the Estimator and copying it into the Model itself? Thanks > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1506#comment-1506 ] Shixiong Zhu commented on SPARK-12386: -- If the user doesn't depend on the assumption that `spark.executor.port` is the port of Akka actor system in executor side, they can just remove the config. Even in 1.5, the assumption is unreliable because multiple executors may run in the same host. > Setting "spark.executor.port" leads to NPE in SparkEnv > -- > > Key: SPARK-12386 > URL: https://issues.apache.org/jira/browse/SPARK-12386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Critical > > From the list: > {quote} > when we set spark.executor.port in 1.6, we get thrown a NPE in > SparkEnv$.create(SparkEnv.scala:259). > {quote} > Fix is simple; probably should make it to 1.6.0 since it will affect anyone > using that config options, but I'll leave that to the release manager's > discretion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12380) MLLib should use existing SQLContext instead create new one
[ https://issues.apache.org/jira/browse/SPARK-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12380. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10338 [https://github.com/apache/spark/pull/10338] > MLLib should use existing SQLContext instead create new one > --- > > Key: SPARK-12380 > URL: https://issues.apache.org/jira/browse/SPARK-12380 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0, 1.6.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12164. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10215 [https://github.com/apache/spark/pull/10215] > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. > For example, > {code} > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12164) [SQL] Display the binary/encoded values
[ https://issues.apache.org/jira/browse/SPARK-12164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12164: - Assignee: Xiao Li > [SQL] Display the binary/encoded values > --- > > Key: SPARK-12164 > URL: https://issues.apache.org/jira/browse/SPARK-12164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > So far, we are using comma-separated decimal format to output the encoded > contents. This way is rare when the data is in binary. This could be a common > issue when we use Dataset API. > For example, > {code} > implicit val kryoEncoder = Encoders.kryo[KryoClassData] > val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), > KryoClassData("c", 3)).toDS() > ds.show(20, false); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Odersky updated SPARK-12350: -- Component/s: (was: ML) Spark Shell Spark Core > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell > Environment: sparkShell command from sbt >Reporter: Jakob Odersky >Assignee: Apache Spark > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12381) Move decision tree helper classes from spark.mllib to spark.ml
Seth Hendrickson created SPARK-12381: Summary: Move decision tree helper classes from spark.mllib to spark.ml Key: SPARK-12381 URL: https://issues.apache.org/jira/browse/SPARK-12381 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Seth Hendrickson The helper classes for decision trees and decision tree ensembles (e.g. Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) currently reside in spark.mllib, but as the algorithm implementations are moved to spark.ml, so should these helper classes. We should take this opportunity to make some of those helper classes private when possible (especially if they are only needed during training) and maybe change the APIs (especially if we can eliminate duplicate data stored in the final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12382) Remove spark.mllib GBT implementation and wrap spark.ml
Seth Hendrickson created SPARK-12382: Summary: Remove spark.mllib GBT implementation and wrap spark.ml Key: SPARK-12382 URL: https://issues.apache.org/jira/browse/SPARK-12382 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson After the GBT implementation is moved to spark.ml, we should remove the implementation from spark.mllib. The MLlib GBTs will then just call the implementation in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9690) Add random seed Param to PySpark CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9690: - Assignee: Martin Menestret > Add random seed Param to PySpark CrossValidator > --- > > Key: SPARK-9690 > URL: https://issues.apache.org/jira/browse/SPARK-9690 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 1.4.1 >Reporter: Martin Menestret >Assignee: Martin Menestret >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The fold in the ML CrossValidator depends on a rand whose seed is set to 0 > and it leads the sql.functions rand to call sc._jvm.functions.rand() with no > seed. > In order to be able to unit test a Cross Validation it would be a good idea > to be able to set this seed so the output of the cross validation (with a > featureSubsetStrategy set to "all") would always be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12380) MLLib should use existing SQLContext instead create new one
[ https://issues.apache.org/jira/browse/SPARK-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060990#comment-15060990 ] Apache Spark commented on SPARK-12380: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10338 > MLLib should use existing SQLContext instead create new one > --- > > Key: SPARK-12380 > URL: https://issues.apache.org/jira/browse/SPARK-12380 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061039#comment-15061039 ] DB Tsai commented on SPARK-12331: - +1 PR is welcome. Thanks. > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Priority: Minor > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12320) throw exception if the number of fields does not line up for Tuple encoder
[ https://issues.apache.org/jira/browse/SPARK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12320. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10293 [https://github.com/apache/spark/pull/10293] > throw exception if the number of fields does not line up for Tuple encoder > -- > > Key: SPARK-12320 > URL: https://issues.apache.org/jira/browse/SPARK-12320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12320) throw exception if the number of fields does not line up for Tuple encoder
[ https://issues.apache.org/jira/browse/SPARK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12320: - Assignee: Wenchen Fan > throw exception if the number of fields does not line up for Tuple encoder > -- > > Key: SPARK-12320 > URL: https://issues.apache.org/jira/browse/SPARK-12320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12380) MLLib should use existing SQLContext instead create new one
Davies Liu created SPARK-12380: -- Summary: MLLib should use existing SQLContext instead create new one Key: SPARK-12380 URL: https://issues.apache.org/jira/browse/SPARK-12380 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-12326: - Description: Several improvements can be made to gradient boosted trees, but are not possible without moving the GBT implementation to spark.ml (e.g. rawPrediction column, feature importance). This Jira is for moving the current GBT implementation to spark.ml, which will have roughly the following steps: 1. Copy the implementation to spark.ml and change spark.ml classes to use that implementation. Current tests will ensure that the implementations learn exactly the same models. 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since eventually all tree implementations will reside in spark.ml, the helper classes should as well. 3. Remove the spark.mllib implementation, and make the spark.mllib APIs wrappers around the spark.ml implementation. The spark.ml tests will again ensure that we do not change any behavior. 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to verify model equivalence. was: Several improvements can be made to gradient boosted trees, but are not possible without moving the GBT implementation to spark.ml (e.g. rawPrediction column, feature importance). This Jira is for moving the current GBT implementation to spark.ml, which will have roughly the following steps: 1. Copy the implementation to spark.ml and change spark.ml classes to use that implementation. Current tests will ensure that the implementations learn exactly the same models. 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since eventually all tree implementations will reside in spark.ml, the helper classes should as well. 3. Remove the spark.mllib implementation, and make the spark.mllib APIs wrappers around the spark.ml implementation. The spark.ml tests will again ensure that we do not change any behavior. 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to verify model equivalence. Steps 2, 3, and 4 should be in separate Jiras. > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12383) Move unit tests for GBT from spark.mllib to spark.ml
Seth Hendrickson created SPARK-12383: Summary: Move unit tests for GBT from spark.mllib to spark.ml Key: SPARK-12383 URL: https://issues.apache.org/jira/browse/SPARK-12383 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson After the GBT implementation is moved from MLlib to ML, we should move the unit tests to ML as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12380) MLLib should use existing SQLContext instead create new one
[ https://issues.apache.org/jira/browse/SPARK-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12380: Assignee: Apache Spark (was: Davies Liu) > MLLib should use existing SQLContext instead create new one > --- > > Key: SPARK-12380 > URL: https://issues.apache.org/jira/browse/SPARK-12380 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12380) MLLib should use existing SQLContext instead create new one
[ https://issues.apache.org/jira/browse/SPARK-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12380: Assignee: Davies Liu (was: Apache Spark) > MLLib should use existing SQLContext instead create new one > --- > > Key: SPARK-12380 > URL: https://issues.apache.org/jira/browse/SPARK-12380 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org