[jira] [Closed] (SPARK-7220) Check whether moving shared params is a compatible change
[ https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-7220. Resolution: Done Fix Version/s: 1.4.0 > Check whether moving shared params is a compatible change > - > > Key: SPARK-7220 > URL: https://issues.apache.org/jira/browse/SPARK-7220 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.4.0 > > > Shared params are private, and their usage are treated as implementation > details. But we need to make sure moving params from shared to a concrete > class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7220) Check whether moving shared params is a compatible change
[ https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518747#comment-14518747 ] Xiangrui Meng commented on SPARK-7220: -- I compiled an example app that calls LinearRegression with elasticNetParam, then I moved methods under HasElasticNetParam to LinearRegressionParams. Without re-compiling, the app jar works with the new Spark assembly jar. So we could treat shared params as implementation details and we don't need to worry about where the methods get declared. > Check whether moving shared params is a compatible change > - > > Key: SPARK-7220 > URL: https://issues.apache.org/jira/browse/SPARK-7220 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.4.0 > > > Shared params are private, and their usage are treated as implementation > details. But we need to make sure moving params from shared to a concrete > class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
[ https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7221: --- Issue Type: Wish (was: New Feature) > Expose the current processed file name of FileInputDStream to the users > --- > > Key: SPARK-7221 > URL: https://issues.apache.org/jira/browse/SPARK-7221 > Project: Spark > Issue Type: Wish > Components: Streaming >Reporter: Saisai Shao >Priority: Minor > > This is a wished feature from Spark user list > (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). > Currently there's no API to get the processed file name for > FileInputDStream, it is useful if we can expose this to the users. > The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
Saisai Shao created SPARK-7221: -- Summary: Expose the current processed file name of FileInputDStream to the users Key: SPARK-7221 URL: https://issues.apache.org/jira/browse/SPARK-7221 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Saisai Shao Priority: Minor This is a wished feature from Spark user list (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). Currently there's no API to get the processed file name for FileInputDStream, it is useful if we can expose this to the users. The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6756) Add compress() to Vector
[ https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6756. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5756 [https://github.com/apache/spark/pull/5756] > Add compress() to Vector > > > Key: SPARK-6756 > URL: https://issues.apache.org/jira/browse/SPARK-6756 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.4.0 > > > Add compress to Vector that automatically convert the underlying vector to > dense or sparse based on number of non-zeros. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7220) Check whether moving shared params is a compatible change
Xiangrui Meng created SPARK-7220: Summary: Check whether moving shared params is a compatible change Key: SPARK-7220 URL: https://issues.apache.org/jira/browse/SPARK-7220 Project: Spark Issue Type: Task Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Shared params are private, and their usage are treated as implementation details. But we need to make sure moving params from shared to a concrete class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518721#comment-14518721 ] Zhang, Liye commented on SPARK-7189: Hi [~vanzin], I think using timestamp is not that precise. This method is very similar with the way using modification time. There will always be situations that several operations finished within very short time (say less than 1 millisecond or even shorter). So timestamp and modification time can not be trusted. The target is to get the status change of the files, including contents change (write operation) and permission change (rename operation). `Inotify` can get the change but it's not available in HDFS before version 2.7. One way to tell the change is to set one flag after each operation and reset the flag after reloading the file. But this will make the code really ugly, a bad option. > History server will always reload the same file even when no log file is > updated > > > Key: SPARK-7189 > URL: https://issues.apache.org/jira/browse/SPARK-7189 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Priority: Minor > > History server will check every log file with it's modification time. It will > reload the file if the file's modification time is later or equal to the > latest modification time it remembered. So it will reload the same file(s) > periodically if the file(s) with the latest modification time even if there > is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7208. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5759 [https://github.com/apache/spark/pull/5759] > Add Matrix, SparseMatrix to __all__ list in linalg.py > - > > Key: SPARK-7208 > URL: https://issues.apache.org/jira/browse/SPARK-7208 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7202: --- Priority: Major (was: Minor) > Add SparseMatrixPickler to SerDe > > > Key: SPARK-7202 > URL: https://issues.apache.org/jira/browse/SPARK-7202 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar > > We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7219) HashingTF should output ML attributes
Xiangrui Meng created SPARK-7219: Summary: HashingTF should output ML attributes Key: SPARK-7219 URL: https://issues.apache.org/jira/browse/SPARK-7219 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng HashingTF knows the output feature dimension, which should be in the output ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7219) HashingTF should output ML attributes
[ https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7219: - Priority: Trivial (was: Major) > HashingTF should output ML attributes > - > > Key: SPARK-7219 > URL: https://issues.apache.org/jira/browse/SPARK-7219 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Trivial > > HashingTF knows the output feature dimension, which should be in the output > ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7194: --- Assignee: Apache Spark > Vectors factors method for sparse vectors should accept the output of > zipWithIndex > -- > > Key: SPARK-7194 > URL: https://issues.apache.org/jira/browse/SPARK-7194 > Project: Spark > Issue Type: Improvement >Reporter: Juliet Hougland >Assignee: Apache Spark > > Let's say we have an RDD of Array[Double] where zero values are explictly > recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD > of sparse vectors, we currently have to: > arr_doubles.map{ array => >val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple => > tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1)) > Vectors.sparse(arrray.length, indexElem) > } > Notice that there is a map step at the end to switch the order of the index > and the element value after .zipWithIndex. There should be a factory method > on the Vectors class that allows you to avoid this flipping of tuple elements > when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518623#comment-14518623 ] Apache Spark commented on SPARK-7194: - User 'kaka1992' has created a pull request for this issue: https://github.com/apache/spark/pull/5766 > Vectors factors method for sparse vectors should accept the output of > zipWithIndex > -- > > Key: SPARK-7194 > URL: https://issues.apache.org/jira/browse/SPARK-7194 > Project: Spark > Issue Type: Improvement >Reporter: Juliet Hougland > > Let's say we have an RDD of Array[Double] where zero values are explictly > recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD > of sparse vectors, we currently have to: > arr_doubles.map{ array => >val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple => > tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1)) > Vectors.sparse(arrray.length, indexElem) > } > Notice that there is a map step at the end to switch the order of the index > and the element value after .zipWithIndex. There should be a factory method > on the Vectors class that allows you to avoid this flipping of tuple elements > when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7194: --- Assignee: (was: Apache Spark) > Vectors factors method for sparse vectors should accept the output of > zipWithIndex > -- > > Key: SPARK-7194 > URL: https://issues.apache.org/jira/browse/SPARK-7194 > Project: Spark > Issue Type: Improvement >Reporter: Juliet Hougland > > Let's say we have an RDD of Array[Double] where zero values are explictly > recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD > of sparse vectors, we currently have to: > arr_doubles.map{ array => >val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple => > tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1)) > Vectors.sparse(arrray.length, indexElem) > } > Notice that there is a map step at the end to switch the order of the index > and the element value after .zipWithIndex. There should be a factory method > on the Vectors class that allows you to avoid this flipping of tuple elements > when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518621#comment-14518621 ] Guoqiang Li commented on SPARK-5556: [spark-summit.pptx|https://issues.apache.org/jira/secure/attachment/12729035/spark-summit.pptx] has introduced the relevant algorithm > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518618#comment-14518618 ] Guoqiang Li commented on SPARK-5556: LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553. Yes LightLDA converge faster,but it takes up more memory > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5556: --- Attachment: spark-summit.pptx > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7193) "Spark on Mesos" may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518610#comment-14518610 ] Littlestar edited comment on SPARK-7193 at 4/29/15 2:40 AM: I think official document missing some notes about "Spark on Mesos" I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOOP_CONF_DIR, HADOOP_HOME was (Author: cnstar9988): I think official document missing some notes about "Spark on Mesos" I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME > "Spark on Mesos" may need more tests for spark 1.3.1 release > > > Key: SPARK-7193 > URL: https://issues.apache.org/jira/browse/SPARK-7193 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Littlestar > > "Spark on Mesos" may need more tests for spark 1.3.1 release > http://spark.apache.org/docs/latest/running-on-mesos.html > I tested mesos 0.21.1/0.22.0/0.22.1 RC4. > It just work well with "./bin/spark-shell --master mesos://host:5050". > Any task need more than one nodes, it will throws the following exceptions. > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: > Lost task 10.3 in stage 0.0 (TID 127, hpblade05): > java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener > EventLoggingListener threw an exception > java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) > at > org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) > at scala.Option.foreach(O
[jira] [Resolved] (SPARK-7193) "Spark on Mesos" may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar resolved SPARK-7193. --- Resolution: Invalid I think official document missing some notes about "Spark on Mesos" I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME > "Spark on Mesos" may need more tests for spark 1.3.1 release > > > Key: SPARK-7193 > URL: https://issues.apache.org/jira/browse/SPARK-7193 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Littlestar > > "Spark on Mesos" may need more tests for spark 1.3.1 release > http://spark.apache.org/docs/latest/running-on-mesos.html > I tested mesos 0.21.1/0.22.0/0.22.1 RC4. > It just work well with "./bin/spark-shell --master mesos://host:5050". > Any task need more than one nodes, it will throws the following exceptions. > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: > Lost task 10.3 in stage 0.0 (TID 127, hpblade05): > java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener > EventLoggingListener threw an exception > java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) > at > org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) > at > org.apache.spark.scheduler.EventLoggingListener.onStageCompleted(EventLoggingListener.scala:165) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerB
[jira] [Resolved] (SPARK-7138) Add method to BlockGenerator to add multiple records to BlockGenerator with single callback
[ https://issues.apache.org/jira/browse/SPARK-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7138. -- Resolution: Fixed Fix Version/s: 1.4.0 > Add method to BlockGenerator to add multiple records to BlockGenerator with > single callback > --- > > Key: SPARK-7138 > URL: https://issues.apache.org/jira/browse/SPARK-7138 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > Fix For: 1.4.0 > > > This is to ensure that receivers that receive data in small batches (like > Kinesis) and want to add them but want the callback function to be called > only once. > This is for internal use only for improvement to Kinesis Receiver that we are > planning to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518602#comment-14518602 ] Guoqiang Li commented on SPARK-5556: I put the latest LDA code in [Zen|https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering] The test results [here|https://issues.apache.org/jira/secure/attachment/12729030/LDA_test.xlsx] (72 cores, 216G ram, 6 servers, Gigabit Ethernet) > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518601#comment-14518601 ] Pedro Rodriguez commented on SPARK-5556: [~gq] is the LDAGibbs line what I implemented or something else? In any case, the optimization on sampling shouldn't change the results, so it looks like LightLDA converges to a better perplexity. Do you have any performance graphs? > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5556: --- Attachment: LDA_test.xlsx > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7218) Create a real iterator with open/close for Spark SQL
Reynold Xin created SPARK-7218: -- Summary: Create a real iterator with open/close for Spark SQL Key: SPARK-7218 URL: https://issues.apache.org/jira/browse/SPARK-7218 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7169) Allow to specify metrics configuration more flexibly
[ https://issues.apache.org/jira/browse/SPARK-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518508#comment-14518508 ] Saisai Shao commented on SPARK-7169: Hi [~jlewandowski], regard to your second problem, I think you don't have to copy the metrics configuration file manually to every machine one by one, you could use spark-submit --file path/to/your/metrics_properties to transfer your configuration to each executor/container. And for the first problem, is it a big problem that all the configuration files need to be in the same directory? I think lot's of Spark as well as Hadoop conf file has such requirement. But you could configure driver/executor with different parameters in conf file, since MetricsSystem supports such features. Yes I think current metrics configuration may not so easy to use, any improvement is greatly appreciated :). > Allow to specify metrics configuration more flexibly > > > Key: SPARK-7169 > URL: https://issues.apache.org/jira/browse/SPARK-7169 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.2, 1.3.1 >Reporter: Jacek Lewandowski >Priority: Minor > > Metrics are configured in {{metrics.properties}} file. Path to this file is > specified in {{SparkConf}} at a key {{spark.metrics.conf}}. The property is > read when {{MetricsSystem}} is created which means, during {{SparkEnv}} > initialisation. > h5.Problem > When the user runs his application he has no way to provide the metrics > configuration for executors. Although one can specify the path to metrics > configuration file (1) the path is common for all the nodes and the client > machine so there is implicit assumption that all the machines has same file > in the same location, and (2) actually the user needs to copy the file > manually to the worker nodes because the file is read before the user files > are populated to the executor local directories. All of this makes it very > difficult to play with the metrics configuration. > h5. Proposed solution > I think that the easiest and the most consistent solution would be to move > the configuration from a separate file directly to {{SparkConf}}. We may > prefix all the configuration settings from the metrics configuration by, say > {{spark.metrics.props}}. For the backward compatibility, these properties > would be loaded from the specified as it works now. Such a solution doesn't > change the API so maybe it could be even included in patch release of Spark > 1.2 and Spark 1.3. > Appreciate any feedback. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()
Tathagata Das created SPARK-7217: Summary: Add configuration to disable stopping of SparkContext when StreamingContext.stop() Key: SPARK-7217 URL: https://issues.apache.org/jira/browse/SPARK-7217 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This JIRA is to add a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6965) StringIndexer should convert input to Strings
[ https://issues.apache.org/jira/browse/SPARK-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6965. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5753 [https://github.com/apache/spark/pull/5753] > StringIndexer should convert input to Strings > - > > Key: SPARK-6965 > URL: https://issues.apache.org/jira/browse/SPARK-6965 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.4.0 > > > StringIndexer should convert non-String input types to String. That way, it > can handle any basic types such as Int, Double, etc. > It can convert any input type to strings first and store the string labels > (instead of an arbitrary type). That will simplify model export/import. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7216) Show driver details in Mesos cluster UI
Timothy Chen created SPARK-7216: --- Summary: Show driver details in Mesos cluster UI Key: SPARK-7216 URL: https://issues.apache.org/jira/browse/SPARK-7216 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518447#comment-14518447 ] Apache Spark commented on SPARK-7216: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/5763 > Show driver details in Mesos cluster UI > --- > > Key: SPARK-7216 > URL: https://issues.apache.org/jira/browse/SPARK-7216 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > > Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7216: --- Assignee: Apache Spark > Show driver details in Mesos cluster UI > --- > > Key: SPARK-7216 > URL: https://issues.apache.org/jira/browse/SPARK-7216 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7216: --- Assignee: (was: Apache Spark) > Show driver details in Mesos cluster UI > --- > > Key: SPARK-7216 > URL: https://issues.apache.org/jira/browse/SPARK-7216 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > > Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518402#comment-14518402 ] Apache Spark commented on SPARK-6627: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/5764 > Clean up of shuffle code and interfaces > --- > > Key: SPARK-6627 > URL: https://issues.apache.org/jira/browse/SPARK-6627 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > Fix For: 1.4.0 > > > The shuffle code in Spark is somewhat messy and could use some interface > clean-up, especially with some larger changes outstanding. This is a catch > all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518400#comment-14518400 ] Joseph K. Bradley commented on SPARK-5556: -- That plan sounds good. I haven't yet been able to look into LightLDA, but it would be good to understand if it's (a) a modification which could be added to Gibbs later on or (b) an algorithm which belongs as a separate algorithm. > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7215: --- Assignee: Apache Spark > Make repartition and coalesce a part of the query plan > -- > > Key: SPARK-7215 > URL: https://issues.apache.org/jira/browse/SPARK-7215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Burak Yavuz >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518378#comment-14518378 ] Pedro Rodriguez commented on SPARK-5556: I will start working on it again then. It would be great for that research project to result in Gibbs being added. The refactoring ended up roadblocking that quite a bit. I think [~gq] was working on something called LightLDA. I don't know the specifics of the algorithm, but the sampler scales theoretically O(1) with topics. My implementation has something which in the testing I did looks like in practice it is O(1) or very near it. To get Gibbs merged in (or as a candidate implementation), how does this look: 1. Refactor code to fit the PR that you just merged 2. Use the testing harness you used for the EM LDA to test with the same conditions. This should be fairly easy since you already did all the work to get things pipelining correctly. 3. If it scales well, then merge or consider other applications 4. Code review somewhere in there. > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518379#comment-14518379 ] Apache Spark commented on SPARK-7215: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5762 > Make repartition and coalesce a part of the query plan > -- > > Key: SPARK-7215 > URL: https://issues.apache.org/jira/browse/SPARK-7215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Burak Yavuz >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7215: --- Assignee: (was: Apache Spark) > Make repartition and coalesce a part of the query plan > -- > > Key: SPARK-7215 > URL: https://issues.apache.org/jira/browse/SPARK-7215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Burak Yavuz >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7215) Make repartition and coalesce a part of the query plan
Burak Yavuz created SPARK-7215: -- Summary: Make repartition and coalesce a part of the query plan Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518305#comment-14518305 ] Apache Spark commented on SPARK-5182: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5526 > Partitioning support for tables created by the data source API > -- > > Key: SPARK-5182 > URL: https://issues.apache.org/jira/browse/SPARK-5182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Cheng Lian >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518286#comment-14518286 ] Sandy Ryza commented on SPARK-3655: --- My opinion is that a secondary sort operator in core Spark would definitely be useful. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7214) Unrolling never evicts blocks when MemoryStore is nearly full
Charles Reiss created SPARK-7214: Summary: Unrolling never evicts blocks when MemoryStore is nearly full Key: SPARK-7214 URL: https://issues.apache.org/jira/browse/SPARK-7214 Project: Spark Issue Type: Bug Components: Block Manager Reporter: Charles Reiss Priority: Minor When less than spark.storage.unrollMemoryThreshold (default 1MB) is left in the MemoryStore, new blocks that are computed with unrollSafely (e.g. any cached RDD split) will always fail unrolling even if old blocks could be dropped to accommodate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7156) Add randomSplit method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518248#comment-14518248 ] Apache Spark commented on SPARK-7156: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5761 > Add randomSplit method to DataFrame > --- > > Key: SPARK-7156 > URL: https://issues.apache.org/jira/browse/SPARK-7156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Joseph K. Bradley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518201#comment-14518201 ] Apache Spark commented on SPARK-7213: - User 'nishkamravi2' has created a pull request for this issue: https://github.com/apache/spark/pull/5760 > Exception while copying Hadoop config files due to permission issues > > > Key: SPARK-7213 > URL: https://issues.apache.org/jira/browse/SPARK-7213 > Project: Spark > Issue Type: Bug >Reporter: Nishkam Ravi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7213: --- Assignee: Apache Spark > Exception while copying Hadoop config files due to permission issues > > > Key: SPARK-7213 > URL: https://issues.apache.org/jira/browse/SPARK-7213 > Project: Spark > Issue Type: Bug >Reporter: Nishkam Ravi >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7213: --- Assignee: (was: Apache Spark) > Exception while copying Hadoop config files due to permission issues > > > Key: SPARK-7213 > URL: https://issues.apache.org/jira/browse/SPARK-7213 > Project: Spark > Issue Type: Bug >Reporter: Nishkam Ravi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518197#comment-14518197 ] Nishkam Ravi commented on SPARK-7213: - PR: https://github.com/apache/spark/pull/5760/ > Exception while copying Hadoop config files due to permission issues > > > Key: SPARK-7213 > URL: https://issues.apache.org/jira/browse/SPARK-7213 > Project: Spark > Issue Type: Bug >Reporter: Nishkam Ravi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518188#comment-14518188 ] Nishkam Ravi commented on SPARK-7213: - Exception in thread "main" java.io.FileNotFoundException: /etc/hadoop/conf/container-executor.cfg (Permission denied) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:126) at com.google.common.io.Files$FileByteSource.openStream(Files.java:116) at com.google.common.io.ByteSource.copyTo(ByteSource.java:233) at com.google.common.io.Files.copy(Files.java:423) at org.apache.spark.deploy.yarn.Client$$anonfun$createConfArchive$2.apply(Client.scala:374) at org.apache.spark.deploy.yarn.Client$$anonfun$createConfArchive$2.apply(Client.scala:372) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:372) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:288) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:466) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:106) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.(SparkContext.scala:470) at org.apache.spark.SparkContext.(SparkContext.scala:155) at org.apache.spark.SparkContext.(SparkContext.scala:192) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:95) at spark.benchmarks.JavaWordCount.main(JavaWordCount.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:619) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Exception while copying Hadoop config files due to permission issues > > > Key: SPARK-7213 > URL: https://issues.apache.org/jira/browse/SPARK-7213 > Project: Spark > Issue Type: Bug >Reporter: Nishkam Ravi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
Nishkam Ravi created SPARK-7213: --- Summary: Exception while copying Hadoop config files due to permission issues Key: SPARK-7213 URL: https://issues.apache.org/jira/browse/SPARK-7213 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518141#comment-14518141 ] Joseph K. Bradley commented on SPARK-5556: -- Great! I'm not aware of blockers. As far as other active implementations, the only ones I know of are those reference by [~gq] above. Please do ping him on your work and see if there are ideas which can be merged. We can help with the coordination and discussions as well. Thanks! > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518133#comment-14518133 ] Pedro Rodriguez commented on SPARK-5556: With the refactoring done, I can get to working on getting the core code running on that interface. Does it seem likely if that is completed, gibbs will get merged for 1.5. Are there any foreseeable blockers or potential different implementations that are being considered? > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7212) Frequent pattern mining for sequential item sets
Joseph K. Bradley created SPARK-7212: Summary: Frequent pattern mining for sequential item sets Key: SPARK-7212 URL: https://issues.apache.org/jira/browse/SPARK-7212 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Currently, FPGrowth handles unordered item sets. It would be great to be able to handle sequences of items, in which the order matters. This JIRA is for discussing modifications to FPGrowth and/or new algorithms for handling sequences. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7211) Improvements for FPGrowth
Joseph K. Bradley created SPARK-7211: Summary: Improvements for FPGrowth Key: SPARK-7211 URL: https://issues.apache.org/jira/browse/SPARK-7211 Project: Spark Issue Type: Umbrella Components: MLlib Reporter: Joseph K. Bradley This is an umbrella JIRA for listing explorations and planned improvements to FPGrowth and other possible algorithms for frequent pattern mining (a.k.a., frequent itemsets, association rules). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians
Joseph K. Bradley created SPARK-7210: Summary: Test matrix decompositions for speed vs. numerical stability for Gaussians Key: SPARK-7210 URL: https://issues.apache.org/jira/browse/SPARK-7210 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor We currently use SVD for inverting the Gaussian's covariance matrix and computing the determinant. SVD is numerically stable but slow. We could experiment with Cholesky, etc. to figure out a better option, or a better option for certain settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7209) Adding new Manning book "Spark in Action" to the official Spark Webpage
[ https://issues.apache.org/jira/browse/SPARK-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksandar Dragosavljevic updated SPARK-7209: - Attachment: Spark in Action.jpg Book cover > Adding new Manning book "Spark in Action" to the official Spark Webpage > --- > > Key: SPARK-7209 > URL: https://issues.apache.org/jira/browse/SPARK-7209 > Project: Spark > Issue Type: Task > Components: Documentation >Reporter: Aleksandar Dragosavljevic >Priority: Minor > Attachments: Spark in Action.jpg > > Original Estimate: 1h > Remaining Estimate: 1h > > Manning Publications is developing a book Spark in Action written by Marko > Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be > great if the book could be added to the list of books at the official Spark > Webpage (https://spark.apache.org/documentation.html). > This book teaches readers to use Spark for stream and batch data processing. > It starts with an introduction to the Spark architecture and ecosystem > followed by a taste of Spark's command line interface. Readers then discover > the most fundamental concepts and abstractions of Spark, particularly > Resilient Distributed Datasets (RDDs) and the basic data transformations that > RDDs provide. The first part of the book also introduces you to writing Spark > applications using the the core APIs. Next, you learn about different Spark > components: how to work with structured data using Spark SQL, how to process > near-real time data with Spark Streaming, how to apply machine learning > algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped > data using Spark GraphX, and a clear introduction to Spark clustering. > The book is already available to the public as a part of our Manning Early > Access Program (MEAP) where we deliver chapters to the public as soon as they > are written. We believe it will offer significant support to the Spark users > and the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518084#comment-14518084 ] Andrew Or commented on SPARK-6943: -- Yeah ideally we will have the job graph that magnifies into the stage graph. I'll see what I can do. > Graphically show RDD's included in a stage > -- > > Key: SPARK-6943 > URL: https://issues.apache.org/jira/browse/SPARK-6943 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Patrick Wendell >Assignee: Andrew Or > Attachments: DAGvisualizationintheSparkWebUI.pdf, with-closures.png, > with-stack-trace.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7209) Adding new Manning book "Spark in Action" to the official Spark Webpage
Aleksandar Dragosavljevic created SPARK-7209: Summary: Adding new Manning book "Spark in Action" to the official Spark Webpage Key: SPARK-7209 URL: https://issues.apache.org/jira/browse/SPARK-7209 Project: Spark Issue Type: Task Components: Documentation Reporter: Aleksandar Dragosavljevic Priority: Minor Manning Publications is developing a book Spark in Action written by Marko Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be great if the book could be added to the list of books at the official Spark Webpage (https://spark.apache.org/documentation.html). This book teaches readers to use Spark for stream and batch data processing. It starts with an introduction to the Spark architecture and ecosystem followed by a taste of Spark's command line interface. Readers then discover the most fundamental concepts and abstractions of Spark, particularly Resilient Distributed Datasets (RDDs) and the basic data transformations that RDDs provide. The first part of the book also introduces you to writing Spark applications using the the core APIs. Next, you learn about different Spark components: how to work with structured data using Spark SQL, how to process near-real time data with Spark Streaming, how to apply machine learning algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data using Spark GraphX, and a clear introduction to Spark clustering. The book is already available to the public as a part of our Manning Early Access Program (MEAP) where we deliver chapters to the public as soon as they are written. We believe it will offer significant support to the Spark users and the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7208: --- Assignee: Apache Spark (was: Joseph K. Bradley) > Add Matrix, SparseMatrix to __all__ list in linalg.py > - > > Key: SPARK-7208 > URL: https://issues.apache.org/jira/browse/SPARK-7208 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7208: --- Assignee: Joseph K. Bradley (was: Apache Spark) > Add Matrix, SparseMatrix to __all__ list in linalg.py > - > > Key: SPARK-7208 > URL: https://issues.apache.org/jira/browse/SPARK-7208 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518075#comment-14518075 ] Apache Spark commented on SPARK-7208: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5759 > Add Matrix, SparseMatrix to __all__ list in linalg.py > - > > Key: SPARK-7208 > URL: https://issues.apache.org/jira/browse/SPARK-7208 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7208: - Summary: Add Matrix, SparseMatrix to __all__ list in linalg.py (was: Add SparseMatrix to __all__ list in linalg.py) > Add Matrix, SparseMatrix to __all__ list in linalg.py > - > > Key: SPARK-7208 > URL: https://issues.apache.org/jira/browse/SPARK-7208 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7208) Add SparseMatrix to __all__ list in linalg.py
Joseph K. Bradley created SPARK-7208: Summary: Add SparseMatrix to __all__ list in linalg.py Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7207) Add new spark.ml subpackages to SparkBuild
[ https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518069#comment-14518069 ] Apache Spark commented on SPARK-7207: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5758 > Add new spark.ml subpackages to SparkBuild > -- > > Key: SPARK-7207 > URL: https://issues.apache.org/jira/browse/SPARK-7207 > Project: Spark > Issue Type: Bug > Components: Build, ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Add to project/SparkBuild.scala list of subpackages for spark.ml: > * ml.recommendation > * ml.regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7207) Add new spark.ml subpackages to SparkBuild
[ https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7207: --- Assignee: Joseph K. Bradley (was: Apache Spark) > Add new spark.ml subpackages to SparkBuild > -- > > Key: SPARK-7207 > URL: https://issues.apache.org/jira/browse/SPARK-7207 > Project: Spark > Issue Type: Bug > Components: Build, ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Add to project/SparkBuild.scala list of subpackages for spark.ml: > * ml.recommendation > * ml.regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7207) Add new spark.ml subpackages to SparkBuild
[ https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7207: --- Assignee: Apache Spark (was: Joseph K. Bradley) > Add new spark.ml subpackages to SparkBuild > -- > > Key: SPARK-7207 > URL: https://issues.apache.org/jira/browse/SPARK-7207 > Project: Spark > Issue Type: Bug > Components: Build, ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Add to project/SparkBuild.scala list of subpackages for spark.ml: > * ml.recommendation > * ml.regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7201) Move identifiable to ml.util
[ https://issues.apache.org/jira/browse/SPARK-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7201. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5749 [https://github.com/apache/spark/pull/5749] > Move identifiable to ml.util > > > Key: SPARK-7201 > URL: https://issues.apache.org/jira/browse/SPARK-7201 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.4.0 > > > It shouldn't live under spark.ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7207) Add new spark.ml subpackages to SparkBuild
Joseph K. Bradley created SPARK-7207: Summary: Add new spark.ml subpackages to SparkBuild Key: SPARK-7207 URL: https://issues.apache.org/jira/browse/SPARK-7207 Project: Spark Issue Type: Bug Components: Build, ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Add to project/SparkBuild.scala list of subpackages for spark.ml: * ml.recommendation * ml.regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
[ https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7204: --- Assignee: Apache Spark (was: Patrick Wendell) > Call sites in UI are not accurate for DataFrame operations > -- > > Key: SPARK-7204 > URL: https://issues.apache.org/jira/browse/SPARK-7204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Patrick Wendell >Assignee: Apache Spark >Priority: Critical > > Spark core computes callsites by climbing up the stack until we reach the > stack frame at the boundary of user code and spark code. The way we compute > whether a given frame is internal (Spark) or user code does not work > correctly with the new dataframe API. > Once the scope work goes in, we'll have a nicer way to express units of > operator scope, but until then there is a simple fix where we just make sure > the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
[ https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7204: --- Assignee: Patrick Wendell (was: Apache Spark) > Call sites in UI are not accurate for DataFrame operations > -- > > Key: SPARK-7204 > URL: https://issues.apache.org/jira/browse/SPARK-7204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Spark core computes callsites by climbing up the stack until we reach the > stack frame at the boundary of user code and spark code. The way we compute > whether a given frame is internal (Spark) or user code does not work > correctly with the new dataframe API. > Once the scope work goes in, we'll have a nicer way to express units of > operator scope, but until then there is a simple fix where we just make sure > the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
[ https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518047#comment-14518047 ] Apache Spark commented on SPARK-7204: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/5757 > Call sites in UI are not accurate for DataFrame operations > -- > > Key: SPARK-7204 > URL: https://issues.apache.org/jira/browse/SPARK-7204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > Spark core computes callsites by climbing up the stack until we reach the > stack frame at the boundary of user code and spark code. The way we compute > whether a given frame is internal (Spark) or user code does not work > correctly with the new dataframe API. > Once the scope work goes in, we'll have a nicer way to express units of > operator scope, but until then there is a simple fix where we just make sure > the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5014) GaussianMixture (GMM) improvements
[ https://issues.apache.org/jira/browse/SPARK-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5014: - Comment: was deleted (was: No need for umbrella JIRA) > GaussianMixture (GMM) improvements > -- > > Key: SPARK-5014 > URL: https://issues.apache.org/jira/browse/SPARK-5014 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > This is an umbrella JIRA for improvements to Gaussian Mixture Models (GMMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7206) Gaussian Mixture Model (GMM) improvements
Joseph K. Bradley created SPARK-7206: Summary: Gaussian Mixture Model (GMM) improvements Key: SPARK-7206 URL: https://issues.apache.org/jira/browse/SPARK-7206 Project: Spark Issue Type: Umbrella Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley This is an umbrella JIRA for listing improvements for GMMs: * planned improvements * optional/experimental work * tests for verifying scalability -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6756) Add compress() to Vector
[ https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518008#comment-14518008 ] Apache Spark commented on SPARK-6756: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5756 > Add compress() to Vector > > > Key: SPARK-6756 > URL: https://issues.apache.org/jira/browse/SPARK-6756 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add compress to Vector that automatically convert the underlying vector to > dense or sparse based on number of non-zeros. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6756) Add compress() to Vector
[ https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6756: --- Assignee: Xiangrui Meng (was: Apache Spark) > Add compress() to Vector > > > Key: SPARK-6756 > URL: https://issues.apache.org/jira/browse/SPARK-6756 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add compress to Vector that automatically convert the underlying vector to > dense or sparse based on number of non-zeros. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6756) Add compress() to Vector
[ https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6756: --- Assignee: Apache Spark (was: Xiangrui Meng) > Add compress() to Vector > > > Key: SPARK-6756 > URL: https://issues.apache.org/jira/browse/SPARK-6756 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Apache Spark > > Add compress to Vector that automatically convert the underlying vector to > dense or sparse based on number of non-zeros. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5338: - Affects Version/s: 1.0.0 > Support cluster mode with Mesos > --- > > Key: SPARK-5338 > URL: https://issues.apache.org/jira/browse/SPARK-5338 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Timothy Chen > Fix For: 1.4.0 > > > Currently using Spark with Mesos, the only supported deployment is client > mode. > It is also useful to have a cluster mode deployment that can be shared and > long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5338. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Timothy Chen Target Version/s: 1.4.0 > Support cluster mode with Mesos > --- > > Key: SPARK-5338 > URL: https://issues.apache.org/jira/browse/SPARK-5338 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Timothy Chen >Assignee: Timothy Chen > Fix For: 1.4.0 > > > Currently using Spark with Mesos, the only supported deployment is client > mode. > It is also useful to have a cluster mode deployment that can be shared and > long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517988#comment-14517988 ] Apache Spark commented on SPARK-7205: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5755 > Support local ivy cache in --packages > - > > Key: SPARK-7205 > URL: https://issues.apache.org/jira/browse/SPARK-7205 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Burak Yavuz >Priority: Critical > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7205: --- Assignee: (was: Apache Spark) > Support local ivy cache in --packages > - > > Key: SPARK-7205 > URL: https://issues.apache.org/jira/browse/SPARK-7205 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Burak Yavuz >Priority: Critical > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7205: --- Assignee: Apache Spark > Support local ivy cache in --packages > - > > Key: SPARK-7205 > URL: https://issues.apache.org/jira/browse/SPARK-7205 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Burak Yavuz >Assignee: Apache Spark >Priority: Critical > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
Patrick Wendell created SPARK-7204: -- Summary: Call sites in UI are not accurate for DataFrame operations Key: SPARK-7204 URL: https://issues.apache.org/jira/browse/SPARK-7204 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Spark core computes callsites by climbing up the stack until we reach the stack frame at the boundary of user code and spark code. The way we compute whether a given frame is internal (Spark) or user code does not work correctly with the new dataframe API. Once the scope work goes in, we'll have a nicer way to express units of operator scope, but until then there is a simple fix where we just make sure the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7205) Support local ivy cache in --packages
Burak Yavuz created SPARK-7205: -- Summary: Support local ivy cache in --packages Key: SPARK-7205 URL: https://issues.apache.org/jira/browse/SPARK-7205 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Burak Yavuz Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946 ] koert kuipers edited comment on SPARK-3655 at 4/28/15 8:19 PM: --- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small spark-sorted library which is available on spark-packages, and that's good enough. was (Author: koert): since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946 ] koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM: --- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. was (Author: koert): since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946 ] koert kuipers commented on SPARK-3655: -- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858 ] Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM: -- added these to the forums AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html Nested Map Columns in DataFrames: https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html Casting columns of DataFrames: https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html was (Author: cfregly): added this to the forums to address the AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517905#comment-14517905 ] Joseph K. Bradley edited comment on SPARK-7202 at 4/28/15 8:01 PM: --- [~MechCoder] I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! was (Author: josephkb): @MechCoder I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! > Add SparseMatrixPickler to SerDe > > > Key: SPARK-7202 > URL: https://issues.apache.org/jira/browse/SPARK-7202 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Priority: Minor > > We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517905#comment-14517905 ] Joseph K. Bradley commented on SPARK-7202: -- @MechCoder I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! > Add SparseMatrixPickler to SerDe > > > Key: SPARK-7202 > URL: https://issues.apache.org/jira/browse/SPARK-7202 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Priority: Minor > > We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7202: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-7203 > Add SparseMatrixPickler to SerDe > > > Key: SPARK-7202 > URL: https://issues.apache.org/jira/browse/SPARK-7202 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Priority: Minor > > We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7203) Python API for local linear algebra
Joseph K. Bradley created SPARK-7203: Summary: Python API for local linear algebra Key: SPARK-7203 URL: https://issues.apache.org/jira/browse/SPARK-7203 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for the Python API for local linear algebra, including: * Vector, Matrix, and their subclasses * helper methods and utilities * interactions with numpy, scipy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7161) Provide REST api to download event logs from History Server
[ https://issues.apache.org/jira/browse/SPARK-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kostas Sakellis updated SPARK-7161: --- Component/s: (was: Streaming) Spark Core > Provide REST api to download event logs from History Server > --- > > Key: SPARK-7161 > URL: https://issues.apache.org/jira/browse/SPARK-7161 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.1 >Reporter: Hari Shreedharan >Priority: Minor > > The idea is to tar up the logs and return the tar.gz file using a REST api. > This can be used for debugging even after the app is done. > I am planning to take a look at this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets
[ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517886#comment-14517886 ] Tristan Nixon commented on SPARK-4414: -- Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X laptop but not on an ec2 linux instance I set up with the spark-c2 script. My local version was built with Hadoop 2.4, but the default for systems configured from the script is Hadoop 1. It seems that this problem goes to the S3 drivers in the different versions of Hadoop. I destroyed and then re-launched my ec2 cluster using the --hadoop-major-version=2 option, and the resulting version works! Perhaps support for Hadoop 1 should be deprecated? At least, it probably should no longer be the default version used in the spark-ec2 scripts. > SparkContext.wholeTextFiles Doesn't work with S3 Buckets > > > Key: SPARK-4414 > URL: https://issues.apache.org/jira/browse/SPARK-4414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pedro Rodriguez >Priority: Critical > > SparkContext.wholeTextFiles does not read files which SparkContext.textFile > can read. Below are general steps to reproduce, my specific case is following > that on a git repo. > Steps to reproduce. > 1. Create Amazon S3 bucket, make public with multiple files > 2. Attempt to read bucket with > sc.wholeTextFiles("s3n://mybucket/myfile.txt") > 3. Spark returns the following error, even if the file exists. > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /myfile.txt > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) > 4. Change the call to > sc.textFile("s3n://mybucket/myfile.txt") > and there is no error message, the application should run fine. > There is a question on StackOverflow as well on this: > http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist > This is link to repo/lines of code. The uncommented call doesn't work, the > commented call works as expected: > https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 > It would be easy to use textFile with a multifile argument, but this should > work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6994) Allow to fetch field values by name in sql.Row
[ https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517882#comment-14517882 ] Shuai Zheng edited comment on SPARK-6994 at 4/28/15 7:48 PM: - I create one more pull request: https://github.com/apache/spark/pull/5754 add few helper method to access field value by name also return type. Basically create: getBoolean(fieldName: String) getByte(fieldName: String) getShort(fieldName: String) getInt(fieldName: String) getLong(fieldName: String) getFloat(fieldName: String) getDouble(fieldName: String) getString(fieldName: String) getDecimal(fieldName: String) This is a trial change, to make java developers life easier (like me...*_*), as we won't benefit from generic feature on getAs[T] Because this is not really a fix, so I think I should not re-open a ticket. Just update here. was (Author: szheng79): I create one more pull request: https://github.com/apache/spark/pull/5754 add few helper method to access field value by name also return type. Basically create: getBoolean(fieldName: String) getByte(fieldName: String) getShort(fieldName: String) getInt(fieldName: String) getLong(fieldName: String) getFloat(fieldName: String) getDouble(fieldName: String) getString(fieldName: String) getDecimal(fieldName: String) This is a trial change, to make java developers life easier (like me...*_*), as we won't benefit from generic feature on getAs[T] > Allow to fetch field values by name in sql.Row > -- > > Key: SPARK-6994 > URL: https://issues.apache.org/jira/browse/SPARK-6994 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: vidmantas zemleris >Assignee: vidmantas zemleris >Priority: Minor > Labels: dataframe, row > Fix For: 1.4.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It looked weird that up to now there was no way in Spark's Scala API to > access fields of `DataFrame/sql.Row` by name, only by their index. > This tries to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6994) Allow to fetch field values by name in sql.Row
[ https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517882#comment-14517882 ] Shuai Zheng commented on SPARK-6994: I create one more pull request: https://github.com/apache/spark/pull/5754 add few helper method to access field value by name also return type. Basically create: getBoolean(fieldName: String) getByte(fieldName: String) getShort(fieldName: String) getInt(fieldName: String) getLong(fieldName: String) getFloat(fieldName: String) getDouble(fieldName: String) getString(fieldName: String) getDecimal(fieldName: String) This is a trial change, to make java developers life easier (like me...*_*), as we won't benefit from generic feature on getAs[T] > Allow to fetch field values by name in sql.Row > -- > > Key: SPARK-6994 > URL: https://issues.apache.org/jira/browse/SPARK-6994 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: vidmantas zemleris >Assignee: vidmantas zemleris >Priority: Minor > Labels: dataframe, row > Fix For: 1.4.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It looked weird that up to now there was no way in Spark's Scala API to > access fields of `DataFrame/sql.Row` by name, only by their index. > This tries to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6994) Allow to fetch field values by name in sql.Row
[ https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517881#comment-14517881 ] Apache Spark commented on SPARK-6994: - User 'szheng79' has created a pull request for this issue: https://github.com/apache/spark/pull/5754 > Allow to fetch field values by name in sql.Row > -- > > Key: SPARK-6994 > URL: https://issues.apache.org/jira/browse/SPARK-6994 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: vidmantas zemleris >Assignee: vidmantas zemleris >Priority: Minor > Labels: dataframe, row > Fix For: 1.4.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It looked weird that up to now there was no way in Spark's Scala API to > access fields of `DataFrame/sql.Row` by name, only by their index. > This tries to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517866#comment-14517866 ] Andrew Or commented on SPARK-6197: -- https://github.com/apache/spark/pull/5736 > handle json parse exception for eventlog file not finished writing > --- > > Key: SPARK-6197 > URL: https://issues.apache.org/jira/browse/SPARK-6197 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > > This is a following JIRA for > [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In > [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can > display event log files that with suffix *.inprogress*. However, the eventlog > file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In > which case, the file maybe truncated in the last line, leading to the line > being not in valid Json format. Which will cause Json parse exception when > reading the file. > For this case, we can just ignore the last line content, since the history > for abnormal cases showed on web is only a reference for user, it can > demonstrate the past status of the app before terminated abnormally (we can > not guarantee the history can show exactly the last moment when app encounter > the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing
[ https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6197: - Fix Version/s: 1.3.1 > handle json parse exception for eventlog file not finished writing > --- > > Key: SPARK-6197 > URL: https://issues.apache.org/jira/browse/SPARK-6197 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > > This is a following JIRA for > [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In > [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can > display event log files that with suffix *.inprogress*. However, the eventlog > file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In > which case, the file maybe truncated in the last line, leading to the line > being not in valid Json format. Which will cause Json parse exception when > reading the file. > For this case, we can just ignore the last line content, since the history > for abnormal cases showed on web is only a reference for user, it can > demonstrate the past status of the app before terminated abnormally (we can > not guarantee the history can show exactly the last moment when app encounter > the abnormal situation). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858 ] Chris Fregly commented on SPARK-7178: - added this to the forums to address the AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html > Improve DataFrame documentation and code samples > > > Key: SPARK-7178 > URL: https://issues.apache.org/jira/browse/SPARK-7178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.1 >Reporter: Chris Fregly > Labels: dataframe > > AND and OR are not straightforward when using the new DataFrame API. > the current convention - accepted by Pandas users - is to use the bitwise & > and | instead of AND and OR. when using these, however, you need to wrap > each expression in parenthesis to prevent the bitwise operator from > dominating. > also, working with StructTypes is a bit confusing. the following link: > https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > (Python tab) implies that you can work with tuples directly when creating a > DataFrame. > however, the following code errors out unless we explicitly use Row's: > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > # The schema is encoded in a string. > schemaString = "a" > fields = [StructField(field_name, MapType(StringType(),IntegerType())) for > field_name in schemaString.split()] > schema = StructType(fields) > df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5389: - Component/s: Windows PySpark > spark-shell.cmd does not run from DOS Windows 7 > --- > > Key: SPARK-5389 > URL: https://issues.apache.org/jira/browse/SPARK-5389 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, Windows >Affects Versions: 1.2.0 > Environment: Windows 7 >Reporter: Yana Kadiyska > Attachments: SparkShell_Win7.JPG, spark_bug.png > > > spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. > spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 > Marking as trivial since calling spark-shell2.cmd also works fine > Attaching a screenshot since the error isn't very useful: > {code} > spark-1.2.0-bin-cdh4>bin\spark-shell.cmd > else was unexpected at this time. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7202: --- Priority: Minor (was: Major) > Add SparseMatrixPickler to SerDe > > > Key: SPARK-7202 > URL: https://issues.apache.org/jira/browse/SPARK-7202 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Manoj Kumar >Priority: Minor > > We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7195) Can't start spark shell or pyspark in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517826#comment-14517826 ] Mark Smiley commented on SPARK-7195: Sean, I added my bug as a comment to the old bug and attached my file there. Can you add PySpark as one of the components involved? I don't have permission to do that. Thanks, Mark > Can't start spark shell or pyspark in Windows 7 > --- > > Key: SPARK-7195 > URL: https://issues.apache.org/jira/browse/SPARK-7195 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 1.3.1 > Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala > 2.11.6, Python 2.7 >Reporter: Mark Smiley > Attachments: spark_bug.png > > > cd\spark\bin dir > spark-shell > yields following error: > find: 'version': No such file or directory > else was unexpected at this time > Same error with > spark-shell2.cmd > PyShell starts but with errors and doesn't work properly once started > (e.g., can't find sc). Can send screenshot of errors on request. > Using Spark 1.3.1 for Hadoop 2.6 binary > Note: Hadoop not installed on machine. > Scala works by itself, Python works by itself > Java works fine (I use it all the time) > Based on another comment, tried Java 7 (1.7.0_79), but it made no difference > (same error). > JAVA_HOME = C:\jdk1.8.0\bin > C:\jdk1.8.0\bin\;C:\Program Files > (x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program > Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS > Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program > Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program > Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files > (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files > (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program > Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access > Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software > Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software > Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program > Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX > 2.9\miktex\bin\x64\;C:\Program Files > (x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org