[jira] [Commented] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532139#comment-14532139 ] Apache Spark commented on SPARK-6824: - User 'hqzizania' has created a pull request for this issue: https://github.com/apache/spark/pull/5969 > Fill the docs for DataFrame API in SparkR > - > > Key: SPARK-6824 > URL: https://issues.apache.org/jira/browse/SPARK-6824 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Qian Huang >Priority: Blocker > > Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6824: --- Assignee: Qian Huang (was: Apache Spark) > Fill the docs for DataFrame API in SparkR > - > > Key: SPARK-6824 > URL: https://issues.apache.org/jira/browse/SPARK-6824 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Qian Huang >Priority: Blocker > > Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6824: --- Assignee: Apache Spark (was: Qian Huang) > Fill the docs for DataFrame API in SparkR > - > > Key: SPARK-6824 > URL: https://issues.apache.org/jira/browse/SPARK-6824 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark >Priority: Blocker > > Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7436) Cannot implement nor use custom StandaloneRecoveryModeFactory implementations
Jacek Lewandowski created SPARK-7436: Summary: Cannot implement nor use custom StandaloneRecoveryModeFactory implementations Key: SPARK-7436 URL: https://issues.apache.org/jira/browse/SPARK-7436 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: Jacek Lewandowski At least, this code fragment is buggy ({{Master.scala}}): {code} case "CUSTOM" => val clazz = Class.forName(conf.get("spark.deploy.recoveryMode.factory")) val factory = clazz.getConstructor(conf.getClass, Serialization.getClass) .newInstance(conf, SerializationExtension(context.system)) .asInstanceOf[StandaloneRecoveryModeFactory] (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7437) Fold "literal in (item1, item2, ..., literal, ...)" into false directly if not in.
Zhongshuai Pei created SPARK-7437: - Summary: Fold "literal in (item1, item2, ..., literal, ...)" into false directly if not in. Key: SPARK-7437 URL: https://issues.apache.org/jira/browse/SPARK-7437 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Zhongshuai Pei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-7431: Assignee: Joseph K. Bradley > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7431: --- Assignee: Apache Spark > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532114#comment-14532114 ] Apache Spark commented on SPARK-7431: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5968 > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7431: --- Assignee: (was: Apache Spark) > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7431: - Priority: Major (was: Critical) > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7431) PySpark CrossValidatorModel needs to call parent init
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7431: - Summary: PySpark CrossValidatorModel needs to call parent init (was: cvModel does not have uid in Python doc test) > PySpark CrossValidatorModel needs to call parent init > - > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7183) Memory leak in netty shuffle with spark standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-7183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532108#comment-14532108 ] Jack Hu commented on SPARK-7183: Hi, [~sowen] Do we plan to add this to 1.3+? If there is any plan to release more minor release for 1.3+ like 1.3.2. > Memory leak in netty shuffle with spark standalone cluster > -- > > Key: SPARK-7183 > URL: https://issues.apache.org/jira/browse/SPARK-7183 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Jack Hu >Assignee: Liang-Chi Hsieh > Labels: memory-leak, netty, shuffle > Fix For: 1.4.0 > > > There is slow leak in netty shuffle with spark cluster in > {{TransportRequestHandler.streamIds}} > In spark cluster, there are some reusable netty connections between two block > managers to get/send blocks between worker/drivers. These connections are > handled by the {{org.apache.spark.network.server.TransportRequestHandler}} in > server side. This handler keep tracking all the streamids negotiate by RPC > when shuffle data need transform in these two block managers and the streamid > is keeping increasing, and never get a chance to be deleted exception this > connection is dropped (seems never happen in normal running). > Here are some detail logs of this {{TransportRequestHandler}} (Note: we add > a log a print the total size of {{TransportRequestHandler.streamIds}}, the > log is "Current set size is N of > org.apache.spark.network.server.TransportRequestHandler@ADDRESS", this set > size is keeping increasing in our test) > {quote} > 15/04/22 21:00:16 DEBUG TransportServer: Shuffle server started on port :46288 > 15/04/22 21:00:16 INFO NettyBlockTransferService: Server created on 46288 > 15/04/22 21:00:31 INFO TransportRequestHandler: Created > TransportRequestHandler > org.apache.spark.network.server.TransportRequestHandler@29a4f3e7 > 15/04/22 21:00:32 TRACE MessageDecoder: Received message RpcRequest: > RpcRequest\{requestId=6655045571437304938, message=[B@59778678\} > 15/04/22 21:00:32 TRACE NettyBlockRpcServer: Received request: > OpenBlocks\{appId=app-20150422210016-, execId=, > blockIds=[broadcast_1_piece0]} > 15/04/22 21:00:32 TRACE NettyBlockRpcServer: Registered streamId > 1387459488000 with 1 buffers > 15/04/22 21:00:33 TRACE TransportRequestHandler: Sent result > RpcResponse\{requestId=6655045571437304938, response=[B@d2840b\} to client > /10.111.7.150:33802 > 15/04/22 21:00:33 TRACE MessageDecoder: Received message ChunkFetchRequest: > ChunkFetchRequest\{streamChunkId=StreamChunkId\{streamId=1387459488000, > chunkIndex=0}} > 15/04/22 21:00:33 TRACE TransportRequestHandler: Received req from > /10.111.7.150:33802 to fetch block StreamChunkId\{streamId=1387459488000, > chunkIndex=0\} > 15/04/22 21:00:33 INFO TransportRequestHandler: Current set size is 1 of > org.apache.spark.network.server.TransportRequestHandler@29a4f3e7 > 15/04/22 21:00:33 TRACE OneForOneStreamManager: Removing stream id > 1387459488000 > 15/04/22 21:00:33 TRACE TransportRequestHandler: Sent result > ChunkFetchSuccess\{streamChunkId=StreamChunkId\{streamId=1387459488000, > chunkIndex=0}, buffer=NioManagedBuffer\{buf=java.nio.HeapByteBuffer[pos=0 > lim=3839 cap=3839]}} to client /10.111.7.150:33802 > 15/04/22 21:00:34 TRACE MessageDecoder: Received message RpcRequest: > RpcRequest\{requestId=6660601528868866371, message=[B@42bed1b8\} > 15/04/22 21:00:34 TRACE NettyBlockRpcServer: Received request: > OpenBlocks\{appId=app-20150422210016-, execId=, > blockIds=[broadcast_3_piece0]} > 15/04/22 21:00:34 TRACE NettyBlockRpcServer: Registered streamId > 1387459488001 with 1 buffers > 15/04/22 21:00:34 TRACE TransportRequestHandler: Sent result > RpcResponse\{requestId=6660601528868866371, response=[B@7fa3fb60\} to client > /10.111.7.150:33802 > 15/04/22 21:00:34 TRACE MessageDecoder: Received message ChunkFetchRequest: > ChunkFetchRequest\{streamChunkId=StreamChunkId\{streamId=1387459488001, > chunkIndex=0}} > 15/04/22 21:00:34 TRACE TransportRequestHandler: Received req from > /10.111.7.150:33802 to fetch block StreamChunkId\{streamId=1387459488001, > chunkIndex=0\} > 15/04/22 21:00:34 INFO TransportRequestHandler: Current set size is 2 of > org.apache.spark.network.server.TransportRequestHandler@29a4f3e7 > 15/04/22 21:00:34 TRACE OneForOneStreamManager: Removing stream id > 1387459488001 > 15/04/22 21:00:34 TRACE TransportRequestHandler: Sent result > ChunkFetchSuccess\{streamChunkId=StreamChunkId\{streamId=1387459488001, > chunkIndex=0}, buffer=NioManagedBuffer\{buf=java.nio.HeapByteBuffer[pos=0 > lim=4277 cap=4277]}} to client /10.111.7.150:33802 > 15/04/22 21:00:34 TRACE MessageDecoder: Received message RpcRequest: > RpcReq
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532106#comment-14532106 ] Reynold Xin commented on SPARK-7230: We should hide them for now. As a matter of fact, I think those shouldn't even exist in the Scala/Python version of DataFrames, but those are hard to remove now. > Make RDD API private in SparkR for Spark 1.4 > > > Key: SPARK-7230 > URL: https://issues.apache.org/jira/browse/SPARK-7230 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Critical > Fix For: 1.4.0 > > > This ticket proposes making the RDD API in SparkR private for the 1.4 > release. The motivation for doing so are discussed in a larger design > document aimed at a more top-down design of the SparkR APIs. A first cut that > discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI > The main points in that document that relate to this ticket are: > - The RDD API requires knowledge of the distributed system and is pretty low > level. This is not very suitable for a number of R users who are used to more > high-level packages that work out of the box. > - The RDD implementation in SparkR is not fully robust right now: we are > missing features like spilling for aggregation, handling partitions which > don't fit in memory etc. There are further limitations like lack of hashCode > for non-native types etc. which might affect user experience. > The only change we will make for now is to not export the RDD functions as > public methods in the SparkR package and I will create another ticket for > discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532101#comment-14532101 ] Sun Rui commented on SPARK-7230: One question here is there are still some basic RDD API methods provided in DataFrame, like map()/flatMap()/MapPartitions() and foreach(). What's our policy on these methods()? Will we also make them private for 1.4 or we will support them for long term? > Make RDD API private in SparkR for Spark 1.4 > > > Key: SPARK-7230 > URL: https://issues.apache.org/jira/browse/SPARK-7230 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Critical > Fix For: 1.4.0 > > > This ticket proposes making the RDD API in SparkR private for the 1.4 > release. The motivation for doing so are discussed in a larger design > document aimed at a more top-down design of the SparkR APIs. A first cut that > discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI > The main points in that document that relate to this ticket are: > - The RDD API requires knowledge of the distributed system and is pretty low > level. This is not very suitable for a number of R users who are used to more > high-level packages that work out of the box. > - The RDD implementation in SparkR is not fully robust right now: we are > missing features like spilling for aggregation, handling partitions which > don't fit in memory etc. There are further limitations like lack of hashCode > for non-native types etc. which might affect user experience. > The only change we will make for now is to not export the RDD functions as > public methods in the SparkR package and I will create another ticket for > discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7262: --- Assignee: (was: Apache Spark) > Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML > package > > > Key: SPARK-7262 > URL: https://issues.apache.org/jira/browse/SPARK-7262 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai > > 1) Handle scaling and addBias internally. > 2) L1/L2 elasticnet using OWLQN optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532088#comment-14532088 ] Apache Spark commented on SPARK-7262: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/5967 > Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML > package > > > Key: SPARK-7262 > URL: https://issues.apache.org/jira/browse/SPARK-7262 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai > > 1) Handle scaling and addBias internally. > 2) L1/L2 elasticnet using OWLQN optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7262: --- Assignee: Apache Spark > Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML > package > > > Key: SPARK-7262 > URL: https://issues.apache.org/jira/browse/SPARK-7262 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai >Assignee: Apache Spark > > 1) Handle scaling and addBias internally. > 2) L1/L2 elasticnet using OWLQN optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7435) Make DataFrame.show() cosistent with that of Scala and pySpark
Sun Rui created SPARK-7435: -- Summary: Make DataFrame.show() cosistent with that of Scala and pySpark Key: SPARK-7435 URL: https://issues.apache.org/jira/browse/SPARK-7435 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Blocker Currently in SparkR, DataFrame has two methods show() and showDF(). show() prints the DataFrame column names and types and showDF() prints the first numRows rows of a DataFrame. In Scala and pySpark, show() is used to prints rows of a DataFrame. We'd better keep API consistent unless there is some important reason. So propose to interchange the names (show() and showDF()) in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5938) Generate row from json efficiently
[ https://issues.apache.org/jira/browse/SPARK-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-5938. - Resolution: Fixed Fix Version/s: 1.4.0 It has been resolved by https://github.com/apache/spark/commit/2d6612cc8b98f767d73c4d15e4065bf3d6c12ea7. > Generate row from json efficiently > -- > > Key: SPARK-5938 > URL: https://issues.apache.org/jira/browse/SPARK-5938 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Nathan Howell >Priority: Minor > Fix For: 1.4.0 > > > Generate row from json efficiently in JsonRDD object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5443) jsonRDD with schema should ignore sub-objects that are omitted in schema
[ https://issues.apache.org/jira/browse/SPARK-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-5443. - Resolution: Fixed Fix Version/s: 1.4.0 It has been resolved by https://github.com/apache/spark/commit/2d6612cc8b98f767d73c4d15e4065bf3d6c12ea7. > jsonRDD with schema should ignore sub-objects that are omitted in schema > > > Key: SPARK-5443 > URL: https://issues.apache.org/jira/browse/SPARK-5443 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.2.0 >Reporter: Derrick Burns >Assignee: Nathan Howell > Fix For: 1.4.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Reading the code for jsonRDD, it appears that all fields of a JSON object are > read into a ROW independent of the provided schema. I would expect it to be > more efficient to only store in the ROW those fields that are explicitly > included in the schema. > For example, assume that I only wish to extract the "id" field of a tweet. > If I provided a schema that simply had one field within a map named "id", > then the row object would only store that field within a map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532070#comment-14532070 ] Sun Rui commented on SPARK-6812: [~shivaram], Yes I agree. Seems there are still two methods, sampleDF() and saveDF(), we can change them back to sample() and save()? > filter() on DataFrame does not work as expected > --- > > Key: SPARK-6812 > URL: https://issues.apache.org/jira/browse/SPARK-6812 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu >Assignee: Sun Rui >Priority: Blocker > Fix For: 1.4.0 > > > {code} > > filter(df, df$age > 21) > Error in filter(df, df$age > 21) : > no method for coercing this S4 class to a vector > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7434) DSL for Pipeline assembly
Joseph K. Bradley created SPARK-7434: Summary: DSL for Pipeline assembly Key: SPARK-7434 URL: https://issues.apache.org/jira/browse/SPARK-7434 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley This will require a design doc to figure out the DSL and figure out how to avoid conflicts in parameters for input and output columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532067#comment-14532067 ] Joseph K. Bradley commented on SPARK-5874: -- [~eronwright] I think that's been mentioned somewhere (a design doc), but I agree this will be *very* helpful. I'll add a JIRA for it. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7262: --- Description: 1) Handle scaling and addBias internally. 2) L1/L2 elasticnet using OWLQN optimizer. was: 1) Handle scaling and addBias internally. 2) L1/L2 elasticnet using OWLQN optimizer. 3) Initial weights should be computed from prior probabilities. 4) Ideally supports multinomial version in this PR. It will depend if ML api support multi-class classification. > Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML > package > > > Key: SPARK-7262 > URL: https://issues.apache.org/jira/browse/SPARK-7262 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai > > 1) Handle scaling and addBias internally. > 2) L1/L2 elasticnet using OWLQN optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7262) Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7262: --- Summary: Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package (was: LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package) > Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML > package > > > Key: SPARK-7262 > URL: https://issues.apache.org/jira/browse/SPARK-7262 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: DB Tsai > > 1) Handle scaling and addBias internally. > 2) L1/L2 elasticnet using OWLQN optimizer. > 3) Initial weights should be computed from prior probabilities. > 4) Ideally supports multinomial version in this PR. It will depend if ML api > support multi-class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tijo Thomas updated SPARK-7399: --- Comment: was deleted (was: Raised a pull request https://github.com/apache/spark/pull/5966) > Master fails on 2.11 with compilation error > --- > > Key: SPARK-7399 > URL: https://issues.apache.org/jira/browse/SPARK-7399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Iulian Dragos > > The current code in master (and 1.4 branch) fails on 2.11 with the following > compilation error: > {code} > [error] /home/ubuntu/workspace/Apache Spark (master) on > 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in > object RDDOperationScope, multiple overloaded alternatives of method > withScope define default arguments. > [error] private[spark] object RDDOperationScope { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532062#comment-14532062 ] Tijo Thomas commented on SPARK-7399: Raised a pull request https://github.com/apache/spark/pull/5966 > Master fails on 2.11 with compilation error > --- > > Key: SPARK-7399 > URL: https://issues.apache.org/jira/browse/SPARK-7399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Iulian Dragos > > The current code in master (and 1.4 branch) fails on 2.11 with the following > compilation error: > {code} > [error] /home/ubuntu/workspace/Apache Spark (master) on > 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in > object RDDOperationScope, multiple overloaded alternatives of method > withScope define default arguments. > [error] private[spark] object RDDOperationScope { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7399: --- Assignee: (was: Apache Spark) > Master fails on 2.11 with compilation error > --- > > Key: SPARK-7399 > URL: https://issues.apache.org/jira/browse/SPARK-7399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Iulian Dragos > > The current code in master (and 1.4 branch) fails on 2.11 with the following > compilation error: > {code} > [error] /home/ubuntu/workspace/Apache Spark (master) on > 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in > object RDDOperationScope, multiple overloaded alternatives of method > withScope define default arguments. > [error] private[spark] object RDDOperationScope { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532061#comment-14532061 ] Apache Spark commented on SPARK-7399: - User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/5966 > Master fails on 2.11 with compilation error > --- > > Key: SPARK-7399 > URL: https://issues.apache.org/jira/browse/SPARK-7399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Iulian Dragos > > The current code in master (and 1.4 branch) fails on 2.11 with the following > compilation error: > {code} > [error] /home/ubuntu/workspace/Apache Spark (master) on > 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in > object RDDOperationScope, multiple overloaded alternatives of method > withScope define default arguments. > [error] private[spark] object RDDOperationScope { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6812: - Fix Version/s: 1.4.0 > filter() on DataFrame does not work as expected > --- > > Key: SPARK-6812 > URL: https://issues.apache.org/jira/browse/SPARK-6812 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu >Assignee: Sun Rui >Priority: Blocker > Fix For: 1.4.0 > > > {code} > > filter(df, df$age > 21) > Error in filter(df, df$age > 21) : > no method for coercing this S4 class to a vector > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7399) Master fails on 2.11 with compilation error
[ https://issues.apache.org/jira/browse/SPARK-7399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7399: --- Assignee: Apache Spark > Master fails on 2.11 with compilation error > --- > > Key: SPARK-7399 > URL: https://issues.apache.org/jira/browse/SPARK-7399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Iulian Dragos >Assignee: Apache Spark > > The current code in master (and 1.4 branch) fails on 2.11 with the following > compilation error: > {code} > [error] /home/ubuntu/workspace/Apache Spark (master) on > 2.11/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in > object RDDOperationScope, multiple overloaded alternatives of method > withScope define default arguments. > [error] private[spark] object RDDOperationScope { > [error] ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6812. -- Resolution: Fixed > filter() on DataFrame does not work as expected > --- > > Key: SPARK-6812 > URL: https://issues.apache.org/jira/browse/SPARK-6812 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu >Assignee: Sun Rui >Priority: Blocker > > {code} > > filter(df, df$age > 21) > Error in filter(df, df$age > 21) : > no method for coercing this S4 class to a vector > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532057#comment-14532057 ] Shivaram Venkataraman commented on SPARK-6812: -- Fixed by https://github.com/apache/spark/pull/5938 > filter() on DataFrame does not work as expected > --- > > Key: SPARK-6812 > URL: https://issues.apache.org/jira/browse/SPARK-6812 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Davies Liu >Assignee: Sun Rui >Priority: Blocker > > {code} > > filter(df, df$age > 21) > Error in filter(df, df$age > 21) : > no method for coercing this S4 class to a vector > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532054#comment-14532054 ] Eron Wright commented on SPARK-5874: - I suggest providing a fluent syntax or dsl for pipeline assembly. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7433) How to pass the parameters to spark SQL backend and set its value to the environment variable through the simba ODBC driver.
vincent zhao created SPARK-7433: --- Summary: How to pass the parameters to spark SQL backend and set its value to the environment variable through the simba ODBC driver. Key: SPARK-7433 URL: https://issues.apache.org/jira/browse/SPARK-7433 Project: Spark Issue Type: Question Components: Java API Affects Versions: 1.3.0 Reporter: vincent zhao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7393) How to improve Spark SQL performance?
[ https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532044#comment-14532044 ] Liang Lee commented on SPARK-7393: -- Dear Dennis, Thank you very much for your kind help. The data is loading from HDFS which stores data on Samsung 840 Pro SSD. We use the following method to do the query and get the above results: val ds = sqlContext.parquetFile(databasepath + item + ".parquet") ds.registerTempTable(item) sqlContext.cacheTable(item) var rs= sqlContext.sql("SELECT * FROM DBA WHERE CHROM=? AND POS=? ") var rst= rs.collect() The schema of the file is like : |-- CHROM: string (nullable = true) |-- POS: string (nullable = true) |-- ID: string (nullable = true) |-- REF: string (nullable = true) |-- ALT: string (nullable = true) |-- QUAL: string (nullable = true) |-- FILTER: string (nullable = true) |-- INFO: string (nullable = true) Also, i"m trying your suggestion .But how to get the accurate query? The statement selection = df.where("CHROM=16") returns error: :22: error: type mismatch; found : String("CHROM=\'16\'") required: org.apache.spark.sql.Column How to write the expression? > How to improve Spark SQL performance? > - > > Key: SPARK-7393 > URL: https://issues.apache.org/jira/browse/SPARK-7393 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang Lee > > We want to use Spark SQL in our project ,but we found that the Spark SQL > performance is not very well as we expected. The detail is as follows: > 1. We save data as parquet file on HDFS. > 2.We just select one or several rows from the parquet file using spark SQL. > 3. When the total record number is 61 million, it needs about 3 seconds to > get the result, which is unacceptable long for our scenario. > 4.When the total record number is 2 million, it needs about 93 ms to get the > result, whcih is still a little long for us. > 5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=? > And the table is not complex, which has less 10 columns and the content for > each column is less than 100 bytes. > 6. Does any one know how to improve the performance or give some other ideas? > 7. Can Spark SQL support micro-second-level response? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
[ https://issues.apache.org/jira/browse/SPARK-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532017#comment-14532017 ] Joseph K. Bradley commented on SPARK-7432: -- It happened again: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32067/console] > Flaky test in PySpark CrossValidator doc test > - > > Key: SPARK-7432 > URL: https://issues.apache.org/jira/browse/SPARK-7432 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > > There was a test failure in the doc test in Python CrossValidator: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] > Here's the full doc test: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator > >>> from pyspark.mllib.linalg import Vectors > >>> dataset = sqlContext.createDataFrame( > ... [(Vectors.dense([0.0, 1.0]), 0.0), > ... (Vectors.dense([1.0, 2.0]), 1.0), > ... (Vectors.dense([0.55, 3.0]), 0.0), > ... (Vectors.dense([0.45, 4.0]), 1.0), > ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, > ... ["features", "label"]) > >>> lr = LogisticRegression() > >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() > >>> evaluator = BinaryClassificationEvaluator() > >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > >>> cvModel = cv.fit(dataset) > >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) > >>> cvModel.transform(dataset).collect() == expected.collect() > True > {code} > Here's the failure message: > {code} > Running test: pyspark/ml/tuning.py ... > ** > File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator > Failed example: > cvModel.transform(dataset).collect() == expected.collect() > Expected: > True > Got: > False > ** >1 of 11 in __main__.CrossValidator > ***Test Failed*** 1 failures. > Had test failures; see logs. > [error] Got a return code of 255 on line 240 of the run-tests script. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5213) Pluggable SQL Parser Support
[ https://issues.apache.org/jira/browse/SPARK-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531988#comment-14531988 ] Apache Spark commented on SPARK-5213: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/5965 > Pluggable SQL Parser Support > > > Key: SPARK-5213 > URL: https://issues.apache.org/jira/browse/SPARK-5213 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 1.4.0 > > > Currently, the SQL Parser dialect is hard code in SQLContext, which is not > easy to extend, we need the features like: > bin/spark-sql --driver-class-path customizedSQL92.jar > -- switch to "hiveql" dialect >spark-sql>SET spark.sql.dialect=hiveql; >spark-sql>SELECT * FROM src LIMIT 1; > -- switch to "sql" dialect >spark-sql>SET spark.sql.dialect=sql; >spark-sql>SELECT * FROM src LIMIT 1; > -- register the new SQL dialect >spark-sql> SET spark.sql.dialect.sql99=com.xxx.xxx.SQL99Dialect; >spark-sql> SET spark.sql.dialect=sql99; >spark-sql> SELECT * FROM src LIMIT 1; > -- register the non-exist SQL dialect >spark-sql> SET spark.sql.dialect.sql92=NotExistedClass; >spark-sql> SET spark.sql.dialect=sql92; >spark-sql> SELECT * FROM src LIMIT 1; > -- Exception will be thrown and switch to dialect "sql" (for SQLContext) or > "hiveql" (for HiveContext) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?
[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-7308: Description: Currently, when there is a fetch failure, you can end up with multiple concurrent attempts for the same stage. Is this intended? At best, it leads to some very confusing behavior, and it makes it hard for the user to make sense of what is going on. At worst, I think this is cause of some very strange errors we've seen errors we've seen from users, where stages start executing before all the dependent stages have completed. This can happen in the following scenario: there is a fetch failure in attempt 0, so the stage is retried. attempt 1 starts. But, tasks from attempt 0 are still running -- some of them can also hit fetch failures after attempt 1 starts. That will cause additional stage attempts to get fired up. There is an attempt to handle this already https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 but that only checks whether the **stage** is running. It really should check whether that **attempt** is still running, but there isn't enough info to do that. I'll also post some info on how to reproduce this. was: Currently, when there is a fetch failure, you can end up with multiple concurrent attempts for the same stage. Is this intended? At best, it leads to some very confusing behavior, and it makes it hard for the user to make sense of what is going on. At worst, I think this is cause of some very strange errors we've seen errors we've seen from users, where stages start executing before all the dependent stages have completed. This can happen in the following scenario: there is a fetch failure in attempt 0, so the stage is retried. attempt 1 starts. But, tasks from attempt 0 are still running -- some of them can also hit fetch failures after attempt 1 starts. That will cause additional stage attempts to get fired up. There is an attempt to handle this already https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 but that only checks whether the **stage** is running. It really should check whether that **attempt** is still running, but there isn't enough info to do that. Given the release timeline, I'm going to submit a PR to just fail fast as soon as we detect there are multiple concurrent attempts. Would like some feedback from others on whether or not this is a good thing to do. (The crazy thing is, when I reproduce this, spark seems to actually do the right thing despite the multiple attempts at the same stage, but I feel like that is probably dumb luck from what I've been testing.) I'll also post some info on how to reproduce this. Finally, if there really shouldn't be multiple concurrent attempts, then we can open another ticket for the proper fix (as opposed to just failiing fast) after the 1.4 release. > Should there be multiple concurrent attempts for one stage? > --- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Imran Rashid > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > I'll also post some info on how to reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?
[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531969#comment-14531969 ] Apache Spark commented on SPARK-7308: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/5964 > Should there be multiple concurrent attempts for one stage? > --- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Imran Rashid >Assignee: Imran Rashid > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > Given the release timeline, I'm going to submit a PR to just fail fast as > soon as we detect there are multiple concurrent attempts. Would like some > feedback from others on whether or not this is a good thing to do. (The > crazy thing is, when I reproduce this, spark seems to actually do the right > thing despite the multiple attempts at the same stage, but I feel like that > is probably dumb luck from what I've been testing.) > I'll also post some info on how to reproduce this. Finally, if there really > shouldn't be multiple concurrent attempts, then we can open another ticket > for the proper fix (as opposed to just failiing fast) after the 1.4 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7411) CTAS parser is incomplete
[ https://issues.apache.org/jira/browse/SPARK-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7411: --- Assignee: Apache Spark (was: Cheng Hao) > CTAS parser is incomplete > - > > Key: SPARK-7411 > URL: https://issues.apache.org/jira/browse/SPARK-7411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > > The change to use an isolated classloader removed the use of the Semantic > Analyzer for parsing CTAS queries. We should fix this before the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7411) CTAS parser is incomplete
[ https://issues.apache.org/jira/browse/SPARK-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531967#comment-14531967 ] Apache Spark commented on SPARK-7411: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/5963 > CTAS parser is incomplete > - > > Key: SPARK-7411 > URL: https://issues.apache.org/jira/browse/SPARK-7411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Michael Armbrust >Assignee: Cheng Hao >Priority: Blocker > > The change to use an isolated classloader removed the use of the Semantic > Analyzer for parsing CTAS queries. We should fix this before the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7411) CTAS parser is incomplete
[ https://issues.apache.org/jira/browse/SPARK-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7411: --- Assignee: Cheng Hao (was: Apache Spark) > CTAS parser is incomplete > - > > Key: SPARK-7411 > URL: https://issues.apache.org/jira/browse/SPARK-7411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Michael Armbrust >Assignee: Cheng Hao >Priority: Blocker > > The change to use an isolated classloader removed the use of the Semantic > Analyzer for parsing CTAS queries. We should fix this before the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7275) Make LogicalRelation public
[ https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526799#comment-14526799 ] Glenn Weidner edited comment on SPARK-7275 at 5/7/15 4:18 AM: -- [~smolav] Can you provide example of where being private makes it more difficult "to work with full logical plans from third party packages"? Thank you. was (Author: gweidner): Santiago M. Mola - can you provide example of where being private makes it more difficult "to work with full logical plans from third party packages"? Thank you. > Make LogicalRelation public > --- > > Key: SPARK-7275 > URL: https://issues.apache.org/jira/browse/SPARK-7275 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Santiago M. Mola >Priority: Minor > > It seems LogicalRelation is the only part of the LogicalPlan that is not > public. This makes it harder to work with full logical plans from third party > packages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531945#comment-14531945 ] meiyoula commented on SPARK-1867: - I have resolved my problem. Actually, the primary cause is ClassNotFoundException. When I add the dependency jars into executor classpath, everything is ok. > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > My SBT dependencies: > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", > // standard, probably unrelated > "com.github.seratch" %% "awscala" % "[0.2,)", > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", > "org.specs2" %% "specs2" % "1.14" % "test", > "org.scala-lang" % "scala-reflect" % "2.10.3", > "org.scalaz" %% "scalaz-core" % "7.0.5", > "net.minidev" % "json-smart" % "1.2" -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Commented] (SPARK-7335) Submitting a query to Thrift Server occurs error: java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-7335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531944#comment-14531944 ] meiyoula commented on SPARK-7335: - I have resolved my problem. Actually, the primary cause is ClassNotFoundException. When I add the dependency jars into executor classpath, everything is ok. > Submitting a query to Thrift Server occurs error: > java.lang.IllegalStateException: unread block data > > > Key: SPARK-7335 > URL: https://issues.apache.org/jira/browse/SPARK-7335 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula >Priority: Critical > > java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:163) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7335) Submitting a query to Thrift Server occurs error: java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-7335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula resolved SPARK-7335. - Resolution: Not A Problem > Submitting a query to Thrift Server occurs error: > java.lang.IllegalStateException: unread block data > > > Key: SPARK-7335 > URL: https://issues.apache.org/jira/browse/SPARK-7335 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula >Priority: Critical > > java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:163) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
[ https://issues.apache.org/jira/browse/SPARK-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7432: --- Assignee: Apache Spark (was: Xiangrui Meng) > Flaky test in PySpark CrossValidator doc test > - > > Key: SPARK-7432 > URL: https://issues.apache.org/jira/browse/SPARK-7432 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Critical > > There was a test failure in the doc test in Python CrossValidator: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] > Here's the full doc test: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator > >>> from pyspark.mllib.linalg import Vectors > >>> dataset = sqlContext.createDataFrame( > ... [(Vectors.dense([0.0, 1.0]), 0.0), > ... (Vectors.dense([1.0, 2.0]), 1.0), > ... (Vectors.dense([0.55, 3.0]), 0.0), > ... (Vectors.dense([0.45, 4.0]), 1.0), > ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, > ... ["features", "label"]) > >>> lr = LogisticRegression() > >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() > >>> evaluator = BinaryClassificationEvaluator() > >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > >>> cvModel = cv.fit(dataset) > >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) > >>> cvModel.transform(dataset).collect() == expected.collect() > True > {code} > Here's the failure message: > {code} > Running test: pyspark/ml/tuning.py ... > ** > File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator > Failed example: > cvModel.transform(dataset).collect() == expected.collect() > Expected: > True > Got: > False > ** >1 of 11 in __main__.CrossValidator > ***Test Failed*** 1 failures. > Had test failures; see logs. > [error] Got a return code of 255 on line 240 of the run-tests script. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
[ https://issues.apache.org/jira/browse/SPARK-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7432: --- Assignee: Xiangrui Meng (was: Apache Spark) > Flaky test in PySpark CrossValidator doc test > - > > Key: SPARK-7432 > URL: https://issues.apache.org/jira/browse/SPARK-7432 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > > There was a test failure in the doc test in Python CrossValidator: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] > Here's the full doc test: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator > >>> from pyspark.mllib.linalg import Vectors > >>> dataset = sqlContext.createDataFrame( > ... [(Vectors.dense([0.0, 1.0]), 0.0), > ... (Vectors.dense([1.0, 2.0]), 1.0), > ... (Vectors.dense([0.55, 3.0]), 0.0), > ... (Vectors.dense([0.45, 4.0]), 1.0), > ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, > ... ["features", "label"]) > >>> lr = LogisticRegression() > >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() > >>> evaluator = BinaryClassificationEvaluator() > >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > >>> cvModel = cv.fit(dataset) > >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) > >>> cvModel.transform(dataset).collect() == expected.collect() > True > {code} > Here's the failure message: > {code} > Running test: pyspark/ml/tuning.py ... > ** > File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator > Failed example: > cvModel.transform(dataset).collect() == expected.collect() > Expected: > True > Got: > False > ** >1 of 11 in __main__.CrossValidator > ***Test Failed*** 1 failures. > Had test failures; see logs. > [error] Got a return code of 255 on line 240 of the run-tests script. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
[ https://issues.apache.org/jira/browse/SPARK-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531934#comment-14531934 ] Apache Spark commented on SPARK-7432: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5962 > Flaky test in PySpark CrossValidator doc test > - > > Key: SPARK-7432 > URL: https://issues.apache.org/jira/browse/SPARK-7432 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > > There was a test failure in the doc test in Python CrossValidator: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] > Here's the full doc test: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator > >>> from pyspark.mllib.linalg import Vectors > >>> dataset = sqlContext.createDataFrame( > ... [(Vectors.dense([0.0, 1.0]), 0.0), > ... (Vectors.dense([1.0, 2.0]), 1.0), > ... (Vectors.dense([0.55, 3.0]), 0.0), > ... (Vectors.dense([0.45, 4.0]), 1.0), > ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, > ... ["features", "label"]) > >>> lr = LogisticRegression() > >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() > >>> evaluator = BinaryClassificationEvaluator() > >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > >>> cvModel = cv.fit(dataset) > >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) > >>> cvModel.transform(dataset).collect() == expected.collect() > True > {code} > Here's the failure message: > {code} > Running test: pyspark/ml/tuning.py ... > ** > File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator > Failed example: > cvModel.transform(dataset).collect() == expected.collect() > Expected: > True > Got: > False > ** >1 of 11 in __main__.CrossValidator > ***Test Failed*** 1 failures. > Had test failures; see logs. > [error] Got a return code of 255 on line 240 of the run-tests script. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li reopened SPARK-7008: This jira should not be closed.. > An implementation of Factorization Machine (LibFM) > -- > > Key: SPARK-7008 > URL: https://issues.apache.org/jira/browse/SPARK-7008 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0, 1.3.1, 1.3.2 >Reporter: zhengruifeng > Labels: features, patch > Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, > QQ20150421-2.png > > > An implementation of Factorization Machines based on Scala and Spark MLlib. > FM is a kind of machine learning algorithm for multi-linear regression, and > is widely used for recommendation. > FM works well in recent years' recommendation competitions. > Ref: > http://libfm.org/ > http://doi.acm.org/10.1145/2168752.2168771 > http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7431) cvModel does not have uid in Python doc test
[ https://issues.apache.org/jira/browse/SPARK-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531929#comment-14531929 ] Joseph K. Bradley commented on SPARK-7431: -- I'm working on this > cvModel does not have uid in Python doc test > > > Key: SPARK-7431 > URL: https://issues.apache.org/jira/browse/SPARK-7431 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Try running the CrossValidator doc test in the pyspark shell. Then type > cvModel to print the model. It will fail in {{Identifiable.__repr__}} since > there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
[ https://issues.apache.org/jira/browse/SPARK-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7432: - Assignee: Xiangrui Meng > Flaky test in PySpark CrossValidator doc test > - > > Key: SPARK-7432 > URL: https://issues.apache.org/jira/browse/SPARK-7432 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > > There was a test failure in the doc test in Python CrossValidator: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] > Here's the full doc test: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator > >>> from pyspark.mllib.linalg import Vectors > >>> dataset = sqlContext.createDataFrame( > ... [(Vectors.dense([0.0, 1.0]), 0.0), > ... (Vectors.dense([1.0, 2.0]), 1.0), > ... (Vectors.dense([0.55, 3.0]), 0.0), > ... (Vectors.dense([0.45, 4.0]), 1.0), > ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, > ... ["features", "label"]) > >>> lr = LogisticRegression() > >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() > >>> evaluator = BinaryClassificationEvaluator() > >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > >>> cvModel = cv.fit(dataset) > >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) > >>> cvModel.transform(dataset).collect() == expected.collect() > True > {code} > Here's the failure message: > {code} > Running test: pyspark/ml/tuning.py ... > ** > File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator > Failed example: > cvModel.transform(dataset).collect() == expected.collect() > Expected: > True > Got: > False > ** >1 of 11 in __main__.CrossValidator > ***Test Failed*** 1 failures. > Had test failures; see logs. > [error] Got a return code of 255 on line 240 of the run-tests script. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7432) Flaky test in PySpark CrossValidator doc test
Joseph K. Bradley created SPARK-7432: Summary: Flaky test in PySpark CrossValidator doc test Key: SPARK-7432 URL: https://issues.apache.org/jira/browse/SPARK-7432 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Priority: Critical There was a test failure in the doc test in Python CrossValidator: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32058/consoleFull] Here's the full doc test: {code} >>> from pyspark.ml.classification import LogisticRegression >>> from pyspark.ml.evaluation import BinaryClassificationEvaluator >>> from pyspark.mllib.linalg import Vectors >>> dataset = sqlContext.createDataFrame( ... [(Vectors.dense([0.0, 1.0]), 0.0), ... (Vectors.dense([1.0, 2.0]), 1.0), ... (Vectors.dense([0.55, 3.0]), 0.0), ... (Vectors.dense([0.45, 4.0]), 1.0), ... (Vectors.dense([0.51, 5.0]), 1.0)] * 10, ... ["features", "label"]) >>> lr = LogisticRegression() >>> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build() >>> evaluator = BinaryClassificationEvaluator() >>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) >>> cvModel = cv.fit(dataset) >>> expected = lr.fit(dataset, {lr.maxIter: 5}).transform(dataset) >>> cvModel.transform(dataset).collect() == expected.collect() True {code} Here's the failure message: {code} Running test: pyspark/ml/tuning.py ... ** File "pyspark/ml/tuning.py", line 108, in __main__.CrossValidator Failed example: cvModel.transform(dataset).collect() == expected.collect() Expected: True Got: False ** 1 of 11 in __main__.CrossValidator ***Test Failed*** 1 failures. Had test failures; see logs. [error] Got a return code of 255 on line 240 of the run-tests script. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7431) cvModel does not have uid in Python doc test
Joseph K. Bradley created SPARK-7431: Summary: cvModel does not have uid in Python doc test Key: SPARK-7431 URL: https://issues.apache.org/jira/browse/SPARK-7431 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Priority: Critical Try running the CrossValidator doc test in the pyspark shell. Then type cvModel to print the model. It will fail in {{Identifiable.__repr__}} since there is no uid defined! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-7008. --- Resolution: Fixed > An implementation of Factorization Machine (LibFM) > -- > > Key: SPARK-7008 > URL: https://issues.apache.org/jira/browse/SPARK-7008 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0, 1.3.1, 1.3.2 >Reporter: zhengruifeng > Labels: features, patch > Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, > QQ20150421-2.png > > > An implementation of Factorization Machines based on Scala and Spark MLlib. > FM is a kind of machine learning algorithm for multi-linear regression, and > is widely used for recommendation. > FM works well in recent years' recommendation competitions. > Ref: > http://libfm.org/ > http://doi.acm.org/10.1145/2168752.2168771 > http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7430) General improvements to streaming tests to increase debuggability
[ https://issues.apache.org/jira/browse/SPARK-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531906#comment-14531906 ] Apache Spark commented on SPARK-7430: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/5961 > General improvements to streaming tests to increase debuggability > - > > Key: SPARK-7430 > URL: https://issues.apache.org/jira/browse/SPARK-7430 > Project: Spark > Issue Type: Test > Components: Streaming, Tests >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7430) General improvements to streaming tests to increase debuggability
[ https://issues.apache.org/jira/browse/SPARK-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7430: - Priority: Critical (was: Major) > General improvements to streaming tests to increase debuggability > - > > Key: SPARK-7430 > URL: https://issues.apache.org/jira/browse/SPARK-7430 > Project: Spark > Issue Type: Test > Components: Streaming, Tests >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7430) General improvements to streaming tests to increase debuggability
Tathagata Das created SPARK-7430: Summary: General improvements to streaming tests to increase debuggability Key: SPARK-7430 URL: https://issues.apache.org/jira/browse/SPARK-7430 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7407) Use uid and param name to identify a parameter instead of the param object
[ https://issues.apache.org/jira/browse/SPARK-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531898#comment-14531898 ] Joseph K. Bradley commented on SPARK-7407: -- I hope we can make this change without changing the user-facing API. That seems very doable for Scala, where ParamMap is a class. It sounds harder for Python. Should we make it a class there too? > Use uid and param name to identify a parameter instead of the param object > -- > > Key: SPARK-7407 > URL: https://issues.apache.org/jira/browse/SPARK-7407 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Transferring parameter values from one to another have been the pain point in > the ML pipeline implementation. Because we use the param object as the key in > the param map, we have to correctly copy them when making a copy of the > transformer, estimator, and models. This becomes complicated when > meta-algorithms are involved. For example, in cross validation: > {code} > val cv = new CrossValidator() > .setEstimator(lr) > .setEstimatorParamMaps(epm) > {code} > When we make a copy of `cv` with extra params that contain estimator params, > {code} > cv.copy(ParamMap(cv.numFolds -> 3, lr.maxIter -> 10)) > {code} > we need to make a copy of the `lr` object as well and map `epm` to use the > new param keys from the old `lr`. This is quite error-prone, especially if > the estimator itself is another meta-algorithm. > Using uid + param name as the key in param maps and using the same uid in > copy (and between estimator/model pairs) would simplify the implementations. > We don't need to change the keys since the copied instance has the same id as > the original instance. And it is easier to find models from a fitted pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7429) Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema
[ https://issues.apache.org/jira/browse/SPARK-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7429: --- Assignee: Joseph K. Bradley (was: Apache Spark) > Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema > > > Key: SPARK-7429 > URL: https://issues.apache.org/jira/browse/SPARK-7429 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Params.setDefault taking a set of ParamPairs should be annotated with > varargs. I thought it would not work before, but it apparently does. > CrossValidator.transform should call transformSchema since the underlying > Model might be a PipelineModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7429) Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema
[ https://issues.apache.org/jira/browse/SPARK-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531893#comment-14531893 ] Apache Spark commented on SPARK-7429: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5960 > Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema > > > Key: SPARK-7429 > URL: https://issues.apache.org/jira/browse/SPARK-7429 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Params.setDefault taking a set of ParamPairs should be annotated with > varargs. I thought it would not work before, but it apparently does. > CrossValidator.transform should call transformSchema since the underlying > Model might be a PipelineModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7429) Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema
[ https://issues.apache.org/jira/browse/SPARK-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7429: --- Assignee: Apache Spark (was: Joseph K. Bradley) > Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema > > > Key: SPARK-7429 > URL: https://issues.apache.org/jira/browse/SPARK-7429 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Params.setDefault taking a set of ParamPairs should be annotated with > varargs. I thought it would not work before, but it apparently does. > CrossValidator.transform should call transformSchema since the underlying > Model might be a PipelineModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7429) Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema
Joseph K. Bradley created SPARK-7429: Summary: Cleanups: Params.setDefault varargs, CrossValidatorModel transformSchema Key: SPARK-7429 URL: https://issues.apache.org/jira/browse/SPARK-7429 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7411) CTAS parser is incomplete
[ https://issues.apache.org/jira/browse/SPARK-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7411: Assignee: Cheng Hao > CTAS parser is incomplete > - > > Key: SPARK-7411 > URL: https://issues.apache.org/jira/browse/SPARK-7411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Michael Armbrust >Assignee: Cheng Hao >Priority: Blocker > > The change to use an isolated classloader removed the use of the Semantic > Analyzer for parsing CTAS queries. We should fix this before the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7428) DataFrame.join() could create a new df with duplicate column name
yan tianxing created SPARK-7428: --- Summary: DataFrame.join() could create a new df with duplicate column name Key: SPARK-7428 URL: https://issues.apache.org/jira/browse/SPARK-7428 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: spark-1.3.0-bin-hadoop2.4 Reporter: yan tianxing >val df = sc.parallelize(Array(1,2,3)).toDF("x") >val df2 = sc.parallelize(Array(1,4,5)).toDF("x") >val df3 = df.join(df2,df("x")===df2("x"),"inner") >df3.show x x 1 1 > df3.select("x") org.apache.spark.sql.AnalysisException: Ambiguous references to x: (x#1,List()),(x#3,List()); at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:211) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:109) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:267) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:121) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:260) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$ano
[jira] [Updated] (SPARK-6943) Graphically show the RDD DAG on the UI
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6943: - Attachment: new-stage-page-5-6-15.png new-job-page-5-6-15.png > Graphically show the RDD DAG on the UI > -- > > Key: SPARK-6943 > URL: https://issues.apache.org/jira/browse/SPARK-6943 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 1.4.0 > > Attachments: DAGvisualizationintheSparkWebUI.pdf, job-page.png, > new-job-page-5-6-15.png, new-stage-page-5-6-15.png, stage-page.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7217) Add configuration to control the default behavior of StreamingContext.stop() implicitly calling SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7217: - Priority: Blocker (was: Major) Target Version/s: 1.4.0 > Add configuration to control the default behavior of StreamingContext.stop() > implicitly calling SparkContext.stop() > --- > > Key: SPARK-7217 > URL: https://issues.apache.org/jira/browse/SPARK-7217 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > In environments like notebooks, the SparkContext is managed by the underlying > infrastructure and it is expected that the SparkContext will not be stopped. > However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive > side-effect. This JIRA is to add a configuration in SparkConf that sets the > default StreamingContext stop behavior. It should be such that the existing > behavior does not change for existing users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
[ https://issues.apache.org/jira/browse/SPARK-6656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-6656: - Assignee: Chris Fregly > Allow the application name to be passed in versus pulling from > SparkContext.getAppName() > - > > Key: SPARK-6656 > URL: https://issues.apache.org/jira/browse/SPARK-6656 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > this is useful for the scenario where Kinesis Spark Streaming is being > invoked from the Spark Shell. in this case, the application name in the > SparkContext is pre-set to "Spark Shell". > this isn't a common or recommended use case, but it's best to make this > configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7427) Make sharedParams match in Scala, Python
[ https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7427: - Description: The documentation for shared Params differs a little between Scala, Python. The Python docs should be modified to match the Scala ones. This will require modifying the sharedParamsCodeGen files. (was: The documentation for shared Params differs a little between Scala, Python. The Python docs should be modified to match the Scala ones.) > Make sharedParams match in Scala, Python > > > Key: SPARK-7427 > URL: https://issues.apache.org/jira/browse/SPARK-7427 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > The documentation for shared Params differs a little between Scala, Python. > The Python docs should be modified to match the Scala ones. This will > require modifying the sharedParamsCodeGen files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7391) DAG visualization: open viz on stage page if from job page
[ https://issues.apache.org/jira/browse/SPARK-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7391: --- Assignee: Apache Spark (was: Andrew Or) > DAG visualization: open viz on stage page if from job page > -- > > Key: SPARK-7391 > URL: https://issues.apache.org/jira/browse/SPARK-7391 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > Right now we can click from the job page to the stage page. But as soon as > you get to the stage page, you will have to open the viz manually again. This > is annoying for users (like me) who expect that clicking from the job page > would expand the stage DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7427) Make sharedParams match in Scala, Python
[ https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7427: - Issue Type: Documentation (was: Improvement) > Make sharedParams match in Scala, Python > > > Key: SPARK-7427 > URL: https://issues.apache.org/jira/browse/SPARK-7427 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > The documentation for shared Params differs a little between Scala, Python. > The Python docs should be modified to match the Scala ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7391) DAG visualization: open viz on stage page if from job page
[ https://issues.apache.org/jira/browse/SPARK-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531876#comment-14531876 ] Apache Spark commented on SPARK-7391: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/5958 > DAG visualization: open viz on stage page if from job page > -- > > Key: SPARK-7391 > URL: https://issues.apache.org/jira/browse/SPARK-7391 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Right now we can click from the job page to the stage page. But as soon as > you get to the stage page, you will have to open the viz manually again. This > is annoying for users (like me) who expect that clicking from the job page > would expand the stage DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7427) Make sharedParams match in Scala, Python
[ https://issues.apache.org/jira/browse/SPARK-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7427: - Labels: starter (was: ) > Make sharedParams match in Scala, Python > > > Key: SPARK-7427 > URL: https://issues.apache.org/jira/browse/SPARK-7427 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > The documentation for shared Params differs a little between Scala, Python. > The Python docs should be modified to match the Scala ones. This will > require modifying the sharedParamsCodeGen files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7391) DAG visualization: open viz on stage page if from job page
[ https://issues.apache.org/jira/browse/SPARK-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7391: --- Assignee: Andrew Or (was: Apache Spark) > DAG visualization: open viz on stage page if from job page > -- > > Key: SPARK-7391 > URL: https://issues.apache.org/jira/browse/SPARK-7391 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Right now we can click from the job page to the stage page. But as soon as > you get to the stage page, you will have to open the viz manually again. This > is annoying for users (like me) who expect that clicking from the job page > would expand the stage DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7427) Make sharedParams match in Scala, Python
Joseph K. Bradley created SPARK-7427: Summary: Make sharedParams match in Scala, Python Key: SPARK-7427 URL: https://issues.apache.org/jira/browse/SPARK-7427 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Joseph K. Bradley Priority: Trivial The documentation for shared Params differs a little between Scala, Python. The Python docs should be modified to match the Scala ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531875#comment-14531875 ] Joseph K. Bradley commented on SPARK-7424: -- I started a little work on this. It will involve modifying PredictorParams.validateAndTransformSchema to copy metadata from the labelCol to the outputCol, if available. It should not of course copy the column name. The PredictionModel will need to store the labelCol attribute, if available. This may require modifying subclasses. It may also require specializing validateAndTransformSchema for Predictor and PredictionModel (making 2 versions). > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7424: - Assignee: (was: Joseph K. Bradley) > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7284) Update streaming documentation for Spark 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7284: - Description: Things to update (continuously updated list) - Python API for Kafka Direct - Pointers to the new Streaming UI - Update Kafka version to 0.8.2.1 - Add ref to RDD.foreachPartitionWithIndex (if merged) was: Things to update (continuously updated list) - Python API for Kafka Direct - Pointers to the new Streaming UI - Update Kafka version to 0.8.2.1 > Update streaming documentation for Spark 1.4.0 release > -- > > Key: SPARK-7284 > URL: https://issues.apache.org/jira/browse/SPARK-7284 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Things to update (continuously updated list) > - Python API for Kafka Direct > - Pointers to the new Streaming UI > - Update Kafka version to 0.8.2.1 > - Add ref to RDD.foreachPartitionWithIndex (if merged) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7347) DAG visualization: add tooltips to RDDs on job page
[ https://issues.apache.org/jira/browse/SPARK-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531866#comment-14531866 ] Apache Spark commented on SPARK-7347: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/5957 > DAG visualization: add tooltips to RDDs on job page > --- > > Key: SPARK-7347 > URL: https://issues.apache.org/jira/browse/SPARK-7347 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Attachments: tooltip.png > > > Currently it's just a bunch of dots and it's not super clear what they > represent. Once we add some tooltips it will be very clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7347) DAG visualization: add tooltips to RDDs on job page
[ https://issues.apache.org/jira/browse/SPARK-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7347: - Summary: DAG visualization: add tooltips to RDDs on job page (was: DAG visualization: add hover to RDDs on job page) > DAG visualization: add tooltips to RDDs on job page > --- > > Key: SPARK-7347 > URL: https://issues.apache.org/jira/browse/SPARK-7347 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Attachments: tooltip.png > > > Currently it's just a bunch of dots and it's not super clear what they > represent. Once we add some tooltips it will be very clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7347) DAG visualization: add tooltips to RDDs on job page
[ https://issues.apache.org/jira/browse/SPARK-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7347: - Attachment: tooltip.png > DAG visualization: add tooltips to RDDs on job page > --- > > Key: SPARK-7347 > URL: https://issues.apache.org/jira/browse/SPARK-7347 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Attachments: tooltip.png > > > Currently it's just a bunch of dots and it's not super clear what they > represent. Once we add some tooltips it will be very clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7347) DAG visualization: add hover to RDDs on job page
[ https://issues.apache.org/jira/browse/SPARK-7347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7347: - Attachment: (was: job-page-hover.png) > DAG visualization: add hover to RDDs on job page > > > Key: SPARK-7347 > URL: https://issues.apache.org/jira/browse/SPARK-7347 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > Currently it's just a bunch of dots and it's not super clear what they > represent. Once we add some tooltips it will be very clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes
Joseph K. Bradley created SPARK-7426: Summary: spark.ml AttributeFactory.fromStructField should allow other NumericTypes Key: SPARK-7426 URL: https://issues.apache.org/jira/browse/SPARK-7426 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor It currently only supports DoubleType, but it should support others, at least for fromStructField (importing into ML attribute format, rather than exporting). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7425) spark.ml Predictor should support other numeric types for label
Joseph K. Bradley created SPARK-7425: Summary: spark.ml Predictor should support other numeric types for label Key: SPARK-7425 URL: https://issues.apache.org/jira/browse/SPARK-7425 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
Joseph K. Bradley created SPARK-7424: Summary: spark.ml classification, regression abstractions should add metadata to output column Key: SPARK-7424 URL: https://issues.apache.org/jira/browse/SPARK-7424 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Update ClassificationModel, ProbabilisticClassificationModel prediction to include numClasses in output column metadata. Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7422) Add argmax to Vector, SparseVector
[ https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7422: - Labels: starter (was: ) > Add argmax to Vector, SparseVector > -- > > Key: SPARK-7422 > URL: https://issues.apache.org/jira/browse/SPARK-7422 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > Labels: starter > > DenseVector has an argmax method which is currently private to Spark. It > would be nice to add that method to Vector and SparseVector. Adding it to > SparseVector would require being careful about handling the inactive elements > correctly and efficiently. > We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7396) Update Producer in Kafka example to use new API of Kafka 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7396: - Fix Version/s: 1.4.0 > Update Producer in Kafka example to use new API of Kafka 0.8.2 > -- > > Key: SPARK-7396 > URL: https://issues.apache.org/jira/browse/SPARK-7396 > Project: Spark > Issue Type: Improvement > Components: Examples, Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.4.0 > > > Due to upgrade of Kafka, current KafkaWordCountProducer will throw below > exception, we need to update the code accordingly. > {code} > Exception in thread "main" kafka.common.FailedToSendMessageException: Failed > to send messages after 3 tries. > at > kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) > at kafka.producer.Producer.send(Producer.scala:77) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7423) spark.ml Classifier predict should not convert vectors to dense format
Joseph K. Bradley created SPARK-7423: Summary: spark.ml Classifier predict should not convert vectors to dense format Key: SPARK-7423 URL: https://issues.apache.org/jira/browse/SPARK-7423 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor spark.ml.classification.ClassificationModel and ProbabilisticClassificationModel both use DenseVector.argmax to implement prediction (computing the prediction from the rawPrediction or probability Vectors). It would be best to implement argmax for Vector and SparseVector and use Vector.argmax, rather than converting Vectors to dense format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7396) Update Producer in Kafka example to use new API of Kafka 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7396: - Issue Type: Improvement (was: Bug) > Update Producer in Kafka example to use new API of Kafka 0.8.2 > -- > > Key: SPARK-7396 > URL: https://issues.apache.org/jira/browse/SPARK-7396 > Project: Spark > Issue Type: Improvement > Components: Examples, Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.4.0 > > > Due to upgrade of Kafka, current KafkaWordCountProducer will throw below > exception, we need to update the code accordingly. > {code} > Exception in thread "main" kafka.common.FailedToSendMessageException: Failed > to send messages after 3 tries. > at > kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) > at kafka.producer.Producer.send(Producer.scala:77) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7405) Fix the bug that ReceiverInputDStream doesn't report InputInfo
[ https://issues.apache.org/jira/browse/SPARK-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7405. -- Resolution: Fixed Fix Version/s: 1.4.0 > Fix the bug that ReceiverInputDStream doesn't report InputInfo > -- > > Key: SPARK-7405 > URL: https://issues.apache.org/jira/browse/SPARK-7405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.4.0 > > > The bug is because SPARK-7139 removed some codes from SPARK-7112 > unintentionally here: > https://github.com/apache/spark/commit/1854ac326a9cc6014817d8df30ed0458eee5d7d1#diff-5c8651dd78abd20439b8eb938175075dL72 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7422) Add argmax to Vector, SparseVector
Joseph K. Bradley created SPARK-7422: Summary: Add argmax to Vector, SparseVector Key: SPARK-7422 URL: https://issues.apache.org/jira/browse/SPARK-7422 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor DenseVector has an argmax method which is currently private to Spark. It would be nice to add that method to Vector and SparseVector. Adding it to SparseVector would require being careful about handling the inactive elements correctly and efficiently. We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7421) Online LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-7421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7421: --- Assignee: Apache Spark (was: Joseph K. Bradley) > Online LDA cleanups > --- > > Key: SPARK-7421 > URL: https://issues.apache.org/jira/browse/SPARK-7421 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Planned changes, primarily to allow us more flexibility in the future: > * Rename "tau_0" to "tau0" > * Mark LDAOptimizer trait sealed and DeveloperApi. > * Mark LDAOptimizer subclasses as final. > * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as > DeveloperApi since we may need to change them in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7421) Online LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-7421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531819#comment-14531819 ] Apache Spark commented on SPARK-7421: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5956 > Online LDA cleanups > --- > > Key: SPARK-7421 > URL: https://issues.apache.org/jira/browse/SPARK-7421 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Planned changes, primarily to allow us more flexibility in the future: > * Rename "tau_0" to "tau0" > * Mark LDAOptimizer trait sealed and DeveloperApi. > * Mark LDAOptimizer subclasses as final. > * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as > DeveloperApi since we may need to change them in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7421) Online LDA cleanups
[ https://issues.apache.org/jira/browse/SPARK-7421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7421: --- Assignee: Joseph K. Bradley (was: Apache Spark) > Online LDA cleanups > --- > > Key: SPARK-7421 > URL: https://issues.apache.org/jira/browse/SPARK-7421 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Planned changes, primarily to allow us more flexibility in the future: > * Rename "tau_0" to "tau0" > * Mark LDAOptimizer trait sealed and DeveloperApi. > * Mark LDAOptimizer subclasses as final. > * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as > DeveloperApi since we may need to change them in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139
[ https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-7397. Resolution: Duplicate > Add missing input information report back to ReceiverInputDStream due to > SPARK-7139 > --- > > Key: SPARK-7397 > URL: https://issues.apache.org/jira/browse/SPARK-7397 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao > > Input information report is missing due to refactor work of > ReceiverInputDStream in SPARK-7139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7396) Update Producer in Kafka example to use new API of Kafka 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7396: - Assignee: Saisai Shao > Update Producer in Kafka example to use new API of Kafka 0.8.2 > -- > > Key: SPARK-7396 > URL: https://issues.apache.org/jira/browse/SPARK-7396 > Project: Spark > Issue Type: Bug > Components: Examples, Streaming >Affects Versions: 1.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > > Due to upgrade of Kafka, current KafkaWordCountProducer will throw below > exception, we need to update the code accordingly. > {code} > Exception in thread "main" kafka.common.FailedToSendMessageException: Failed > to send messages after 3 tries. > at > kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) > at kafka.producer.Producer.send(Producer.scala:77) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) > at > org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7405) Fix the bug that ReceiverInputDStream doesn't report InputInfo
[ https://issues.apache.org/jira/browse/SPARK-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7405: - Assignee: Shixiong Zhu > Fix the bug that ReceiverInputDStream doesn't report InputInfo > -- > > Key: SPARK-7405 > URL: https://issues.apache.org/jira/browse/SPARK-7405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The bug is because SPARK-7139 removed some codes from SPARK-7112 > unintentionally here: > https://github.com/apache/spark/commit/1854ac326a9cc6014817d8df30ed0458eee5d7d1#diff-5c8651dd78abd20439b8eb938175075dL72 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7421) Online LDA cleanups
Joseph K. Bradley created SPARK-7421: Summary: Online LDA cleanups Key: SPARK-7421 URL: https://issues.apache.org/jira/browse/SPARK-7421 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Planned changes, primarily to allow us more flexibility in the future: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7377) DAG visualization: JS error when there is only 1 RDD
[ https://issues.apache.org/jira/browse/SPARK-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7377. Resolution: Fixed Fix Version/s: 1.4.0 > DAG visualization: JS error when there is only 1 RDD > > > Key: SPARK-7377 > URL: https://issues.apache.org/jira/browse/SPARK-7377 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.4.0 > > Attachments: viz-bug.png > > > See screenshot. There is a simple fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org