[jira] [Closed] (SPARK-7220) Check whether moving shared params is a compatible change

2015-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-7220.

   Resolution: Done
Fix Version/s: 1.4.0

> Check whether moving shared params is a compatible change
> -
>
> Key: SPARK-7220
> URL: https://issues.apache.org/jira/browse/SPARK-7220
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.4.0
>
>
> Shared params are private, and their usage are treated as implementation 
> details. But we need to make sure moving params from shared to a concrete 
> class is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7220) Check whether moving shared params is a compatible change

2015-04-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518747#comment-14518747
 ] 

Xiangrui Meng commented on SPARK-7220:
--

I compiled an example app that calls LinearRegression with elasticNetParam, 
then I moved methods under HasElasticNetParam to LinearRegressionParams. 
Without re-compiling, the app jar works with the new Spark assembly jar. So we 
could treat shared params as implementation details and we don't need to worry 
about where the methods get declared.

> Check whether moving shared params is a compatible change
> -
>
> Key: SPARK-7220
> URL: https://issues.apache.org/jira/browse/SPARK-7220
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.4.0
>
>
> Shared params are private, and their usage are treated as implementation 
> details. But we need to make sure moving params from shared to a concrete 
> class is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-04-28 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7221:
---
Issue Type: Wish  (was: New Feature)

> Expose the current processed file name of FileInputDStream to the users
> ---
>
> Key: SPARK-7221
> URL: https://issues.apache.org/jira/browse/SPARK-7221
> Project: Spark
>  Issue Type: Wish
>  Components: Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> This is a wished feature from Spark user list 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
>  Currently there's no API to get the processed file name for 
> FileInputDStream, it is useful if we can expose this to the users. 
> The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-04-28 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-7221:
--

 Summary: Expose the current processed file name of 
FileInputDStream to the users
 Key: SPARK-7221
 URL: https://issues.apache.org/jira/browse/SPARK-7221
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Saisai Shao
Priority: Minor


This is a wished feature from Spark user list 
(http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
 Currently there's no API to get the processed file name for FileInputDStream, 
it is useful if we can expose this to the users. 

The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6756) Add compress() to Vector

2015-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6756.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5756
[https://github.com/apache/spark/pull/5756]

> Add compress() to Vector
> 
>
> Key: SPARK-6756
> URL: https://issues.apache.org/jira/browse/SPARK-6756
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Add compress to Vector that automatically convert the underlying vector to 
> dense or sparse based on number of non-zeros.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7220) Check whether moving shared params is a compatible change

2015-04-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7220:


 Summary: Check whether moving shared params is a compatible change
 Key: SPARK-7220
 URL: https://issues.apache.org/jira/browse/SPARK-7220
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


Shared params are private, and their usage are treated as implementation 
details. But we need to make sure moving params from shared to a concrete class 
is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518721#comment-14518721
 ] 

Zhang, Liye commented on SPARK-7189:


Hi [~vanzin], I think using timestamp is not that precise. This method is very 
similar with the way using modification time. There will always be situations 
that several operations finished within very short time (say less than 1 
millisecond or even shorter). So timestamp and modification time can not be 
trusted. 

The target is to get the status change of the files, including contents change 
(write operation) and permission change (rename operation). `Inotify` can get 
the change but it's not available in HDFS before version 2.7. One way to tell 
the change is to set one flag after each operation and reset the flag after 
reloading the file. But this will make the code really ugly, a bad option. 

> History server will always reload the same file even when no log file is 
> updated
> 
>
> Key: SPARK-7189
> URL: https://issues.apache.org/jira/browse/SPARK-7189
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> History server will check every log file with it's modification time. It will 
> reload the file if the file's modification time is later or equal to the 
> latest modification time it remembered. So it will reload the same file(s) 
> periodically if the file(s) with the latest modification time even if there 
> is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7208.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5759
[https://github.com/apache/spark/pull/5759]

> Add Matrix, SparseMatrix to __all__ list in linalg.py
> -
>
> Key: SPARK-7208
> URL: https://issues.apache.org/jira/browse/SPARK-7208
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7202:
---
Priority: Major  (was: Minor)

> Add SparseMatrixPickler to SerDe
> 
>
> Key: SPARK-7202
> URL: https://issues.apache.org/jira/browse/SPARK-7202
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>
> We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7219) HashingTF should output ML attributes

2015-04-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7219:


 Summary: HashingTF should output ML attributes
 Key: SPARK-7219
 URL: https://issues.apache.org/jira/browse/SPARK-7219
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


HashingTF knows the output feature dimension, which should be in the output ML 
attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7219) HashingTF should output ML attributes

2015-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7219:
-
Priority: Trivial  (was: Major)

> HashingTF should output ML attributes
> -
>
> Key: SPARK-7219
> URL: https://issues.apache.org/jira/browse/SPARK-7219
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Trivial
>
> HashingTF knows the output feature dimension, which should be in the output 
> ML attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7194:
---

Assignee: Apache Spark

> Vectors factors method for sparse vectors should accept the output of 
> zipWithIndex
> --
>
> Key: SPARK-7194
> URL: https://issues.apache.org/jira/browse/SPARK-7194
> Project: Spark
>  Issue Type: Improvement
>Reporter: Juliet Hougland
>Assignee: Apache Spark
>
> Let's say we have an RDD of Array[Double] where zero values are explictly 
> recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
> of sparse vectors, we currently have to:
> arr_doubles.map{ array =>
>val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =>  
> tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1))
> Vectors.sparse(arrray.length, indexElem)
> }
> Notice that there is a map step at the end to switch the order of the index 
> and the element value after .zipWithIndex. There should be a factory method 
> on the Vectors class that allows you to avoid this flipping of tuple elements 
> when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518623#comment-14518623
 ] 

Apache Spark commented on SPARK-7194:
-

User 'kaka1992' has created a pull request for this issue:
https://github.com/apache/spark/pull/5766

> Vectors factors method for sparse vectors should accept the output of 
> zipWithIndex
> --
>
> Key: SPARK-7194
> URL: https://issues.apache.org/jira/browse/SPARK-7194
> Project: Spark
>  Issue Type: Improvement
>Reporter: Juliet Hougland
>
> Let's say we have an RDD of Array[Double] where zero values are explictly 
> recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
> of sparse vectors, we currently have to:
> arr_doubles.map{ array =>
>val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =>  
> tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1))
> Vectors.sparse(arrray.length, indexElem)
> }
> Notice that there is a map step at the end to switch the order of the index 
> and the element value after .zipWithIndex. There should be a factory method 
> on the Vectors class that allows you to avoid this flipping of tuple elements 
> when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7194:
---

Assignee: (was: Apache Spark)

> Vectors factors method for sparse vectors should accept the output of 
> zipWithIndex
> --
>
> Key: SPARK-7194
> URL: https://issues.apache.org/jira/browse/SPARK-7194
> Project: Spark
>  Issue Type: Improvement
>Reporter: Juliet Hougland
>
> Let's say we have an RDD of Array[Double] where zero values are explictly 
> recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
> of sparse vectors, we currently have to:
> arr_doubles.map{ array =>
>val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =>  
> tuple._1 != 0.0).map(tuple => (tuple._2, tuple._1))
> Vectors.sparse(arrray.length, indexElem)
> }
> Notice that there is a map step at the end to switch the order of the index 
> and the element value after .zipWithIndex. There should be a factory method 
> on the Vectors class that allows you to avoid this flipping of tuple elements 
> when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518621#comment-14518621
 ] 

Guoqiang Li commented on SPARK-5556:


[spark-summit.pptx|https://issues.apache.org/jira/secure/attachment/12729035/spark-summit.pptx]
 has introduced the relevant algorithm

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518618#comment-14518618
 ] 

Guoqiang Li commented on SPARK-5556:


LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. 
 The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or  
https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553.

Yes LightLDA converge faster,but it takes up more memory




> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5556:
---
Attachment: spark-summit.pptx

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7193) "Spark on Mesos" may need more tests for spark 1.3.1 release

2015-04-28 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518610#comment-14518610
 ] 

Littlestar edited comment on SPARK-7193 at 4/29/15 2:40 AM:


I think official document missing some notes about "Spark on Mesos"

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOOP_CONF_DIR, HADOOP_HOME







was (Author: cnstar9988):
I think official document missing some notes about "Spark on Mesos"

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME






> "Spark on Mesos" may need more tests for spark 1.3.1 release
> 
>
> Key: SPARK-7193
> URL: https://issues.apache.org/jira/browse/SPARK-7193
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Littlestar
>
> "Spark on Mesos" may need more tests for spark 1.3.1 release
> http://spark.apache.org/docs/latest/running-on-mesos.html
> I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
> It just work well with "./bin/spark-shell --master mesos://host:5050".
> Any task need more than one nodes, it will throws the following exceptions.
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
> java.lang.IllegalStateException: unread block data
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:679)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener 
> EventLoggingListener threw an exception
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at scala.Option.foreach(O

[jira] [Resolved] (SPARK-7193) "Spark on Mesos" may need more tests for spark 1.3.1 release

2015-04-28 Thread Littlestar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Littlestar resolved SPARK-7193.
---
Resolution: Invalid

I think official document missing some notes about "Spark on Mesos"

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME






> "Spark on Mesos" may need more tests for spark 1.3.1 release
> 
>
> Key: SPARK-7193
> URL: https://issues.apache.org/jira/browse/SPARK-7193
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Littlestar
>
> "Spark on Mesos" may need more tests for spark 1.3.1 release
> http://spark.apache.org/docs/latest/running-on-mesos.html
> I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
> It just work well with "./bin/spark-shell --master mesos://host:5050".
> Any task need more than one nodes, it will throws the following exceptions.
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
> java.lang.IllegalStateException: unread block data
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:679)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener 
> EventLoggingListener threw an exception
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onStageCompleted(EventLoggingListener.scala:165)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerB

[jira] [Resolved] (SPARK-7138) Add method to BlockGenerator to add multiple records to BlockGenerator with single callback

2015-04-28 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7138.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> Add method to BlockGenerator to add multiple records to BlockGenerator with 
> single callback
> ---
>
> Key: SPARK-7138
> URL: https://issues.apache.org/jira/browse/SPARK-7138
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is to ensure that receivers that receive data in small batches (like 
> Kinesis) and want to add them but want the callback function to be called 
> only once.
> This is for internal use only for improvement to Kinesis Receiver that we are 
> planning to do. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518602#comment-14518602
 ] 

Guoqiang Li commented on SPARK-5556:


I put the latest LDA code in 
[Zen|https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering]
  
The test results 
[here|https://issues.apache.org/jira/secure/attachment/12729030/LDA_test.xlsx] 
(72 cores, 216G ram, 6 servers, Gigabit Ethernet)

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518601#comment-14518601
 ] 

Pedro Rodriguez commented on SPARK-5556:


[~gq] is the LDAGibbs line what I implemented or something else? In any case, 
the optimization on sampling shouldn't change the results, so it looks like 
LightLDA converges to a better perplexity.

Do you have any performance graphs?

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5556:
---
Attachment: LDA_test.xlsx

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-04-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7218:
--

 Summary: Create a real iterator with open/close for Spark SQL
 Key: SPARK-7218
 URL: https://issues.apache.org/jira/browse/SPARK-7218
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7169) Allow to specify metrics configuration more flexibly

2015-04-28 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518508#comment-14518508
 ] 

Saisai Shao commented on SPARK-7169:


Hi [~jlewandowski], regard to your second problem, I think you don't have to 
copy the metrics configuration file manually to every machine one by one, you 
could use spark-submit --file path/to/your/metrics_properties to transfer your 
configuration to each executor/container.

And for the first problem, is it a big problem that all the configuration files 
need to be in the same directory? I think lot's of Spark as well as Hadoop conf 
file has such requirement. But you could configure driver/executor with 
different parameters in conf file, since MetricsSystem supports such features.

Yes I think current metrics configuration may not so easy to use, any 
improvement is greatly appreciated :).

> Allow to specify metrics configuration more flexibly
> 
>
> Key: SPARK-7169
> URL: https://issues.apache.org/jira/browse/SPARK-7169
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.2, 1.3.1
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> Metrics are configured in {{metrics.properties}} file. Path to this file is 
> specified in {{SparkConf}} at a key {{spark.metrics.conf}}. The property is 
> read when {{MetricsSystem}} is created which means, during {{SparkEnv}} 
> initialisation. 
> h5.Problem
> When the user runs his application he has no way to provide the metrics 
> configuration for executors. Although one can specify the path to metrics 
> configuration file (1) the path is common for all the nodes and the client 
> machine so there is implicit assumption that all the machines has same file 
> in the same location, and (2) actually the user needs to copy the file 
> manually to the worker nodes because the file is read before the user files 
> are populated to the executor local directories. All of this makes it very 
> difficult to play with the metrics configuration.
> h5. Proposed solution
> I think that the easiest and the most consistent solution would be to move 
> the configuration from a separate file directly to {{SparkConf}}. We may 
> prefix all the configuration settings from the metrics configuration by, say 
> {{spark.metrics.props}}. For the backward compatibility, these properties 
> would be loaded from the specified as it works now. Such a solution doesn't 
> change the API so maybe it could be even included in patch release of Spark 
> 1.2 and Spark 1.3.
> Appreciate any feedback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()

2015-04-28 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-7217:


 Summary: Add configuration to disable stopping of SparkContext 
when StreamingContext.stop()
 Key: SPARK-7217
 URL: https://issues.apache.org/jira/browse/SPARK-7217
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das


In environments like notebooks, the SparkContext is managed by the underlying 
infrastructure and it is expected that the SparkContext will not be stopped. 
However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive 
side-effect. This JIRA is to add a configuration in SparkConf that sets the 
default StreamingContext stop behavior. It should be such that the existing 
behavior does not change for existing users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6965) StringIndexer should convert input to Strings

2015-04-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6965.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5753
[https://github.com/apache/spark/pull/5753]

> StringIndexer should convert input to Strings
> -
>
> Key: SPARK-6965
> URL: https://issues.apache.org/jira/browse/SPARK-6965
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.4.0
>
>
> StringIndexer should convert non-String input types to String.  That way, it 
> can handle any basic types such as Int, Double, etc.
> It can convert any input type to strings first and store the string labels 
> (instead of an arbitrary type).  That will simplify model export/import.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7216) Show driver details in Mesos cluster UI

2015-04-28 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-7216:
---

 Summary: Show driver details in Mesos cluster UI
 Key: SPARK-7216
 URL: https://issues.apache.org/jira/browse/SPARK-7216
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7216) Show driver details in Mesos cluster UI

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518447#comment-14518447
 ] 

Apache Spark commented on SPARK-7216:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5763

> Show driver details in Mesos cluster UI
> ---
>
> Key: SPARK-7216
> URL: https://issues.apache.org/jira/browse/SPARK-7216
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7216:
---

Assignee: Apache Spark

> Show driver details in Mesos cluster UI
> ---
>
> Key: SPARK-7216
> URL: https://issues.apache.org/jira/browse/SPARK-7216
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7216:
---

Assignee: (was: Apache Spark)

> Show driver details in Mesos cluster UI
> ---
>
> Key: SPARK-7216
> URL: https://issues.apache.org/jira/browse/SPARK-7216
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518402#comment-14518402
 ] 

Apache Spark commented on SPARK-6627:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/5764

> Clean up of shuffle code and interfaces
> ---
>
> Key: SPARK-6627
> URL: https://issues.apache.org/jira/browse/SPARK-6627
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
> Fix For: 1.4.0
>
>
> The shuffle code in Spark is somewhat messy and could use some interface 
> clean-up, especially with some larger changes outstanding. This is a catch 
> all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518400#comment-14518400
 ] 

Joseph K. Bradley commented on SPARK-5556:
--

That plan sounds good.  I haven't yet been able to look into LightLDA, but it 
would be good to understand if it's (a) a modification which could be added to 
Gibbs later on or (b) an algorithm which belongs as a separate algorithm.

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7215:
---

Assignee: Apache Spark

> Make repartition and coalesce a part of the query plan
> --
>
> Key: SPARK-7215
> URL: https://issues.apache.org/jira/browse/SPARK-7215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518378#comment-14518378
 ] 

Pedro Rodriguez commented on SPARK-5556:


I will start working on it again then. It would be great for that research 
project to result in Gibbs being added. The refactoring ended up roadblocking 
that quite a bit.

I think [~gq] was working on something called LightLDA. I don't know the 
specifics of the algorithm, but the sampler scales theoretically O(1) with 
topics. My implementation has something which in the testing I did looks like 
in practice it is O(1) or very near it.

To get Gibbs merged in (or as a candidate implementation), how does this look:
1. Refactor code to fit the PR that you just merged
2. Use the testing harness you used for the EM LDA to test with the same 
conditions. This should be fairly easy since you already did all the work to 
get things pipelining correctly.
3. If it scales well, then merge or consider other applications
4. Code review somewhere in there.

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518379#comment-14518379
 ] 

Apache Spark commented on SPARK-7215:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5762

> Make repartition and coalesce a part of the query plan
> --
>
> Key: SPARK-7215
> URL: https://issues.apache.org/jira/browse/SPARK-7215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7215:
---

Assignee: (was: Apache Spark)

> Make repartition and coalesce a part of the query plan
> --
>
> Key: SPARK-7215
> URL: https://issues.apache.org/jira/browse/SPARK-7215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-28 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-7215:
--

 Summary: Make repartition and coalesce a part of the query plan
 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518305#comment-14518305
 ] 

Apache Spark commented on SPARK-5182:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5526

> Partitioning support for tables created by the data source API
> --
>
> Key: SPARK-5182
> URL: https://issues.apache.org/jira/browse/SPARK-5182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518286#comment-14518286
 ] 

Sandy Ryza commented on SPARK-3655:
---

My opinion is that a secondary sort operator in core Spark would definitely be 
useful.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7214) Unrolling never evicts blocks when MemoryStore is nearly full

2015-04-28 Thread Charles Reiss (JIRA)
Charles Reiss created SPARK-7214:


 Summary: Unrolling never evicts blocks when MemoryStore is nearly 
full
 Key: SPARK-7214
 URL: https://issues.apache.org/jira/browse/SPARK-7214
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Reporter: Charles Reiss
Priority: Minor


When less than spark.storage.unrollMemoryThreshold (default 1MB) is left in the 
MemoryStore, new blocks that are computed with unrollSafely (e.g. any cached 
RDD split) will always fail unrolling even if old blocks could be dropped to 
accommodate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7156) Add randomSplit method to DataFrame

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518248#comment-14518248
 ] 

Apache Spark commented on SPARK-7156:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5761

> Add randomSplit method to DataFrame
> ---
>
> Key: SPARK-7156
> URL: https://issues.apache.org/jira/browse/SPARK-7156
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518201#comment-14518201
 ] 

Apache Spark commented on SPARK-7213:
-

User 'nishkamravi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/5760

> Exception while copying Hadoop config files due to permission issues
> 
>
> Key: SPARK-7213
> URL: https://issues.apache.org/jira/browse/SPARK-7213
> Project: Spark
>  Issue Type: Bug
>Reporter: Nishkam Ravi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7213:
---

Assignee: Apache Spark

> Exception while copying Hadoop config files due to permission issues
> 
>
> Key: SPARK-7213
> URL: https://issues.apache.org/jira/browse/SPARK-7213
> Project: Spark
>  Issue Type: Bug
>Reporter: Nishkam Ravi
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7213:
---

Assignee: (was: Apache Spark)

> Exception while copying Hadoop config files due to permission issues
> 
>
> Key: SPARK-7213
> URL: https://issues.apache.org/jira/browse/SPARK-7213
> Project: Spark
>  Issue Type: Bug
>Reporter: Nishkam Ravi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518197#comment-14518197
 ] 

Nishkam Ravi commented on SPARK-7213:
-

PR: https://github.com/apache/spark/pull/5760/

> Exception while copying Hadoop config files due to permission issues
> 
>
> Key: SPARK-7213
> URL: https://issues.apache.org/jira/browse/SPARK-7213
> Project: Spark
>  Issue Type: Bug
>Reporter: Nishkam Ravi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518188#comment-14518188
 ] 

Nishkam Ravi commented on SPARK-7213:
-

Exception in thread "main" java.io.FileNotFoundException: 
/etc/hadoop/conf/container-executor.cfg (Permission denied)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:126)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:116)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:233)
at com.google.common.io.Files.copy(Files.java:423)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$createConfArchive$2.apply(Client.scala:374)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$createConfArchive$2.apply(Client.scala:372)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.deploy.yarn.Client.createConfArchive(Client.scala:372)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:288)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:466)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:106)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.(SparkContext.scala:470)
at org.apache.spark.SparkContext.(SparkContext.scala:155)
at org.apache.spark.SparkContext.(SparkContext.scala:192)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:95)
at spark.benchmarks.JavaWordCount.main(JavaWordCount.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:619)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


> Exception while copying Hadoop config files due to permission issues
> 
>
> Key: SPARK-7213
> URL: https://issues.apache.org/jira/browse/SPARK-7213
> Project: Spark
>  Issue Type: Bug
>Reporter: Nishkam Ravi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-28 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-7213:
---

 Summary: Exception while copying Hadoop config files due to 
permission issues
 Key: SPARK-7213
 URL: https://issues.apache.org/jira/browse/SPARK-7213
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518141#comment-14518141
 ] 

Joseph K. Bradley commented on SPARK-5556:
--

Great!  I'm not aware of blockers.  As far as other active implementations, the 
only ones I know of are those reference by [~gq] above.  Please do ping him on 
your work and see if there are ideas which can be merged.  We can help with the 
coordination and discussions as well.  Thanks!

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518133#comment-14518133
 ] 

Pedro Rodriguez commented on SPARK-5556:


With the refactoring done, I can get to working on getting the core code 
running on that interface. 

Does it seem likely if that is completed, gibbs will get merged for 1.5. Are 
there any foreseeable blockers or potential different implementations that are 
being considered?

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7212) Frequent pattern mining for sequential item sets

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7212:


 Summary: Frequent pattern mining for sequential item sets
 Key: SPARK-7212
 URL: https://issues.apache.org/jira/browse/SPARK-7212
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley


Currently, FPGrowth handles unordered item sets.  It would be great to be able 
to handle sequences of items, in which the order matters.  This JIRA is for 
discussing modifications to FPGrowth and/or new algorithms for handling 
sequences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7211) Improvements for FPGrowth

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7211:


 Summary: Improvements for FPGrowth
 Key: SPARK-7211
 URL: https://issues.apache.org/jira/browse/SPARK-7211
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Reporter: Joseph K. Bradley


This is an umbrella JIRA for listing explorations and planned improvements to 
FPGrowth and other possible algorithms for frequent pattern mining (a.k.a., 
frequent itemsets, association rules).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7210:


 Summary: Test matrix decompositions for speed vs. numerical 
stability for Gaussians
 Key: SPARK-7210
 URL: https://issues.apache.org/jira/browse/SPARK-7210
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


We currently use SVD for inverting the Gaussian's covariance matrix and 
computing the determinant.  SVD is numerically stable but slow.  We could 
experiment with Cholesky, etc. to figure out a better option, or a better 
option for certain settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7209) Adding new Manning book "Spark in Action" to the official Spark Webpage

2015-04-28 Thread Aleksandar Dragosavljevic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Dragosavljevic updated SPARK-7209:
-
Attachment: Spark in Action.jpg

Book cover

> Adding new Manning book "Spark in Action" to the official Spark Webpage
> ---
>
> Key: SPARK-7209
> URL: https://issues.apache.org/jira/browse/SPARK-7209
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Aleksandar Dragosavljevic
>Priority: Minor
> Attachments: Spark in Action.jpg
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Manning Publications is developing a book Spark in Action written by Marko 
> Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be 
> great if the book could be added to the list of books at the official Spark 
> Webpage (https://spark.apache.org/documentation.html).
> This book teaches readers to use Spark for stream and batch data processing. 
> It starts with an introduction to the Spark architecture and ecosystem 
> followed by a taste of Spark's command line interface. Readers then discover 
> the most fundamental concepts and abstractions of Spark, particularly 
> Resilient Distributed Datasets (RDDs) and the basic data transformations that 
> RDDs provide. The first part of the book also introduces you to writing Spark 
> applications using the the core APIs. Next, you learn about different Spark 
> components: how to work with structured data using Spark SQL, how to process 
> near-real time data with Spark Streaming, how to apply machine learning 
> algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped 
> data using Spark GraphX, and a clear introduction to Spark clustering.
> The book is already available to the public as a part of our Manning Early 
> Access Program (MEAP) where we deliver chapters to the public as soon as they 
> are written. We believe it will offer significant support to the Spark users 
> and the community.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage

2015-04-28 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518084#comment-14518084
 ] 

Andrew Or commented on SPARK-6943:
--

Yeah ideally we will have the job graph that magnifies into the stage graph. 
I'll see what I can do.

> Graphically show RDD's included in a stage
> --
>
> Key: SPARK-6943
> URL: https://issues.apache.org/jira/browse/SPARK-6943
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Attachments: DAGvisualizationintheSparkWebUI.pdf, with-closures.png, 
> with-stack-trace.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7209) Adding new Manning book "Spark in Action" to the official Spark Webpage

2015-04-28 Thread Aleksandar Dragosavljevic (JIRA)
Aleksandar Dragosavljevic created SPARK-7209:


 Summary: Adding new Manning book "Spark in Action" to the official 
Spark Webpage
 Key: SPARK-7209
 URL: https://issues.apache.org/jira/browse/SPARK-7209
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Aleksandar Dragosavljevic
Priority: Minor


Manning Publications is developing a book Spark in Action written by Marko 
Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be great 
if the book could be added to the list of books at the official Spark Webpage 
(https://spark.apache.org/documentation.html).

This book teaches readers to use Spark for stream and batch data processing. It 
starts with an introduction to the Spark architecture and ecosystem followed by 
a taste of Spark's command line interface. Readers then discover the most 
fundamental concepts and abstractions of Spark, particularly Resilient 
Distributed Datasets (RDDs) and the basic data transformations that RDDs 
provide. The first part of the book also introduces you to writing Spark 
applications using the the core APIs. Next, you learn about different Spark 
components: how to work with structured data using Spark SQL, how to process 
near-real time data with Spark Streaming, how to apply machine learning 
algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data 
using Spark GraphX, and a clear introduction to Spark clustering.

The book is already available to the public as a part of our Manning Early 
Access Program (MEAP) where we deliver chapters to the public as soon as they 
are written. We believe it will offer significant support to the Spark users 
and the community.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7208:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Add Matrix, SparseMatrix to __all__ list in linalg.py
> -
>
> Key: SPARK-7208
> URL: https://issues.apache.org/jira/browse/SPARK-7208
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7208:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

> Add Matrix, SparseMatrix to __all__ list in linalg.py
> -
>
> Key: SPARK-7208
> URL: https://issues.apache.org/jira/browse/SPARK-7208
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518075#comment-14518075
 ] 

Apache Spark commented on SPARK-7208:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/5759

> Add Matrix, SparseMatrix to __all__ list in linalg.py
> -
>
> Key: SPARK-7208
> URL: https://issues.apache.org/jira/browse/SPARK-7208
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7208:
-
Summary: Add Matrix, SparseMatrix to __all__ list in linalg.py  (was: Add 
SparseMatrix to __all__ list in linalg.py)

> Add Matrix, SparseMatrix to __all__ list in linalg.py
> -
>
> Key: SPARK-7208
> URL: https://issues.apache.org/jira/browse/SPARK-7208
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7208) Add SparseMatrix to __all__ list in linalg.py

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7208:


 Summary: Add SparseMatrix to __all__ list in linalg.py
 Key: SPARK-7208
 URL: https://issues.apache.org/jira/browse/SPARK-7208
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7207) Add new spark.ml subpackages to SparkBuild

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518069#comment-14518069
 ] 

Apache Spark commented on SPARK-7207:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/5758

> Add new spark.ml subpackages to SparkBuild
> --
>
> Key: SPARK-7207
> URL: https://issues.apache.org/jira/browse/SPARK-7207
> Project: Spark
>  Issue Type: Bug
>  Components: Build, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Add to project/SparkBuild.scala list of subpackages for spark.ml:
> * ml.recommendation
> * ml.regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7207) Add new spark.ml subpackages to SparkBuild

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7207:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

> Add new spark.ml subpackages to SparkBuild
> --
>
> Key: SPARK-7207
> URL: https://issues.apache.org/jira/browse/SPARK-7207
> Project: Spark
>  Issue Type: Bug
>  Components: Build, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Add to project/SparkBuild.scala list of subpackages for spark.ml:
> * ml.recommendation
> * ml.regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7207) Add new spark.ml subpackages to SparkBuild

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7207:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Add new spark.ml subpackages to SparkBuild
> --
>
> Key: SPARK-7207
> URL: https://issues.apache.org/jira/browse/SPARK-7207
> Project: Spark
>  Issue Type: Bug
>  Components: Build, ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Add to project/SparkBuild.scala list of subpackages for spark.ml:
> * ml.recommendation
> * ml.regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7201) Move identifiable to ml.util

2015-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7201.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5749
[https://github.com/apache/spark/pull/5749]

> Move identifiable to ml.util
> 
>
> Key: SPARK-7201
> URL: https://issues.apache.org/jira/browse/SPARK-7201
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> It shouldn't live under spark.ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7207) Add new spark.ml subpackages to SparkBuild

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7207:


 Summary: Add new spark.ml subpackages to SparkBuild
 Key: SPARK-7207
 URL: https://issues.apache.org/jira/browse/SPARK-7207
 Project: Spark
  Issue Type: Bug
  Components: Build, ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


Add to project/SparkBuild.scala list of subpackages for spark.ml:
* ml.recommendation
* ml.regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7204:
---

Assignee: Apache Spark  (was: Patrick Wendell)

> Call sites in UI are not accurate for DataFrame operations
> --
>
> Key: SPARK-7204
> URL: https://issues.apache.org/jira/browse/SPARK-7204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Critical
>
> Spark core computes callsites by climbing up the stack until we reach the 
> stack frame at the boundary of user code and spark code. The way we compute 
> whether a given frame is internal (Spark) or user code does not work 
> correctly with the new dataframe API.
> Once the scope work goes in, we'll have a nicer way to express units of 
> operator scope, but until then there is a simple fix where we just make sure 
> the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7204:
---

Assignee: Patrick Wendell  (was: Apache Spark)

> Call sites in UI are not accurate for DataFrame operations
> --
>
> Key: SPARK-7204
> URL: https://issues.apache.org/jira/browse/SPARK-7204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> Spark core computes callsites by climbing up the stack until we reach the 
> stack frame at the boundary of user code and spark code. The way we compute 
> whether a given frame is internal (Spark) or user code does not work 
> correctly with the new dataframe API.
> Once the scope work goes in, we'll have a nicer way to express units of 
> operator scope, but until then there is a simple fix where we just make sure 
> the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518047#comment-14518047
 ] 

Apache Spark commented on SPARK-7204:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/5757

> Call sites in UI are not accurate for DataFrame operations
> --
>
> Key: SPARK-7204
> URL: https://issues.apache.org/jira/browse/SPARK-7204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> Spark core computes callsites by climbing up the stack until we reach the 
> stack frame at the boundary of user code and spark code. The way we compute 
> whether a given frame is internal (Spark) or user code does not work 
> correctly with the new dataframe API.
> Once the scope work goes in, we'll have a nicer way to express units of 
> operator scope, but until then there is a simple fix where we just make sure 
> the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5014) GaussianMixture (GMM) improvements

2015-04-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5014:
-
Comment: was deleted

(was: No need for umbrella JIRA)

> GaussianMixture (GMM) improvements
> --
>
> Key: SPARK-5014
> URL: https://issues.apache.org/jira/browse/SPARK-5014
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This is an umbrella JIRA for improvements to Gaussian Mixture Models (GMMs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7206) Gaussian Mixture Model (GMM) improvements

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7206:


 Summary: Gaussian Mixture Model (GMM) improvements
 Key: SPARK-7206
 URL: https://issues.apache.org/jira/browse/SPARK-7206
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


This is an umbrella JIRA for listing improvements for GMMs:
* planned improvements
* optional/experimental work
* tests for verifying scalability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6756) Add compress() to Vector

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518008#comment-14518008
 ] 

Apache Spark commented on SPARK-6756:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5756

> Add compress() to Vector
> 
>
> Key: SPARK-6756
> URL: https://issues.apache.org/jira/browse/SPARK-6756
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add compress to Vector that automatically convert the underlying vector to 
> dense or sparse based on number of non-zeros.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6756) Add compress() to Vector

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6756:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Add compress() to Vector
> 
>
> Key: SPARK-6756
> URL: https://issues.apache.org/jira/browse/SPARK-6756
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add compress to Vector that automatically convert the underlying vector to 
> dense or sparse based on number of non-zeros.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6756) Add compress() to Vector

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6756:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

> Add compress() to Vector
> 
>
> Key: SPARK-6756
> URL: https://issues.apache.org/jira/browse/SPARK-6756
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Add compress to Vector that automatically convert the underlying vector to 
> dense or sparse based on number of non-zeros.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5338) Support cluster mode with Mesos

2015-04-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5338:
-
Affects Version/s: 1.0.0

> Support cluster mode with Mesos
> ---
>
> Key: SPARK-5338
> URL: https://issues.apache.org/jira/browse/SPARK-5338
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
> Fix For: 1.4.0
>
>
> Currently using Spark with Mesos, the only supported deployment is client 
> mode.
> It is also useful to have a cluster mode deployment that can be shared and 
> long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5338) Support cluster mode with Mesos

2015-04-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5338.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Timothy Chen
Target Version/s: 1.4.0

> Support cluster mode with Mesos
> ---
>
> Key: SPARK-5338
> URL: https://issues.apache.org/jira/browse/SPARK-5338
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 1.4.0
>
>
> Currently using Spark with Mesos, the only supported deployment is client 
> mode.
> It is also useful to have a cluster mode deployment that can be shared and 
> long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7205) Support local ivy cache in --packages

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517988#comment-14517988
 ] 

Apache Spark commented on SPARK-7205:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5755

> Support local ivy cache in --packages
> -
>
> Key: SPARK-7205
> URL: https://issues.apache.org/jira/browse/SPARK-7205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Burak Yavuz
>Priority: Critical
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7205:
---

Assignee: (was: Apache Spark)

> Support local ivy cache in --packages
> -
>
> Key: SPARK-7205
> URL: https://issues.apache.org/jira/browse/SPARK-7205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Burak Yavuz
>Priority: Critical
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages

2015-04-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7205:
---

Assignee: Apache Spark

> Support local ivy cache in --packages
> -
>
> Key: SPARK-7205
> URL: https://issues.apache.org/jira/browse/SPARK-7205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-28 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-7204:
--

 Summary: Call sites in UI are not accurate for DataFrame operations
 Key: SPARK-7204
 URL: https://issues.apache.org/jira/browse/SPARK-7204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical


Spark core computes callsites by climbing up the stack until we reach the stack 
frame at the boundary of user code and spark code. The way we compute whether a 
given frame is internal (Spark) or user code does not work correctly with the 
new dataframe API.

Once the scope work goes in, we'll have a nicer way to express units of 
operator scope, but until then there is a simple fix where we just make sure 
the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7205) Support local ivy cache in --packages

2015-04-28 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-7205:
--

 Summary: Support local ivy cache in --packages
 Key: SPARK-7205
 URL: https://issues.apache.org/jira/browse/SPARK-7205
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Burak Yavuz
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946
 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:19 PM:
---

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small spark-sorted library which is available on spark-packages, and 
that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946
 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM:
---

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946
 ] 

koert kuipers commented on SPARK-3655:
--

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858
 ] 

Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM:
--

added these to the forums

AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

Nested Map Columns in DataFrames:
https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html

Casting columns of DataFrames:
https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html


was (Author: cfregly):
added this to the forums to address the AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

> Improve DataFrame documentation and code samples
> 
>
> Key: SPARK-7178
> URL: https://issues.apache.org/jira/browse/SPARK-7178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Chris Fregly
>  Labels: dataframe
>
> AND and OR are not straightforward when using the new DataFrame API.
> the current convention - accepted by Pandas users - is to use the bitwise & 
> and | instead of AND and OR.  when using these, however, you need to wrap 
> each expression in parenthesis to prevent the bitwise operator from 
> dominating.
> also, working with StructTypes is a bit confusing.  the following link:  
> https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>  (Python tab) implies that you can work with tuples directly when creating a 
> DataFrame.
> however, the following code errors out unless we explicitly use Row's:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> # The schema is encoded in a string.
> schemaString = "a"
> fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
> field_name in schemaString.split()]
> schema = StructType(fields)
> df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517905#comment-14517905
 ] 

Joseph K. Bradley edited comment on SPARK-7202 at 4/28/15 8:01 PM:
---

[~MechCoder]   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!


was (Author: josephkb):
@MechCoder   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!

> Add SparseMatrixPickler to SerDe
> 
>
> Key: SPARK-7202
> URL: https://issues.apache.org/jira/browse/SPARK-7202
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517905#comment-14517905
 ] 

Joseph K. Bradley commented on SPARK-7202:
--

@MechCoder   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!

> Add SparseMatrixPickler to SerDe
> 
>
> Key: SPARK-7202
> URL: https://issues.apache.org/jira/browse/SPARK-7202
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7202:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-7203

> Add SparseMatrixPickler to SerDe
> 
>
> Key: SPARK-7202
> URL: https://issues.apache.org/jira/browse/SPARK-7202
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7203) Python API for local linear algebra

2015-04-28 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-7203:


 Summary: Python API for local linear algebra
 Key: SPARK-7203
 URL: https://issues.apache.org/jira/browse/SPARK-7203
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical


This is an umbrella JIRA for the Python API for local linear algebra, including:
* Vector, Matrix, and their subclasses
* helper methods and utilities
* interactions with numpy, scipy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7161) Provide REST api to download event logs from History Server

2015-04-28 Thread Kostas Sakellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Sakellis updated SPARK-7161:
---
Component/s: (was: Streaming)
 Spark Core

> Provide REST api to download event logs from History Server
> ---
>
> Key: SPARK-7161
> URL: https://issues.apache.org/jira/browse/SPARK-7161
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Hari Shreedharan
>Priority: Minor
>
> The idea is to tar up the logs and return the tar.gz file using a REST api. 
> This can be used for debugging even after the app is done.
> I am planning to take a look at this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-04-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517886#comment-14517886
 ] 

Tristan Nixon commented on SPARK-4414:
--

Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X 
laptop but not on an ec2 linux instance I set up with the spark-c2 script. My 
local version was built with Hadoop 2.4, but the default for systems configured 
from the script is Hadoop 1. It seems that this problem goes to the S3 drivers 
in the different versions of Hadoop.

I destroyed and then re-launched my ec2 cluster using the 
--hadoop-major-version=2 option, and the resulting version works!

Perhaps support for Hadoop 1 should be deprecated? At least, it probably should 
no longer be the default version used in the spark-ec2 scripts.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6994) Allow to fetch field values by name in sql.Row

2015-04-28 Thread Shuai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517882#comment-14517882
 ] 

Shuai Zheng edited comment on SPARK-6994 at 4/28/15 7:48 PM:
-

I create one more pull request:
https://github.com/apache/spark/pull/5754
add few helper method to access field value by name also return type.
Basically create:
getBoolean(fieldName: String)
getByte(fieldName: String)
getShort(fieldName: String)
getInt(fieldName: String)
getLong(fieldName: String)
getFloat(fieldName: String)
getDouble(fieldName: String)
getString(fieldName: String)
getDecimal(fieldName: String)

This is a trial change, to make java developers life easier (like me...*_*), as 
we won't benefit from generic feature on getAs[T] 

Because this is not really a fix, so I think I should not re-open a ticket. 
Just update here.


was (Author: szheng79):
I create one more pull request:
https://github.com/apache/spark/pull/5754
add few helper method to access field value by name also return type.
Basically create:
getBoolean(fieldName: String)
getByte(fieldName: String)
getShort(fieldName: String)
getInt(fieldName: String)
getLong(fieldName: String)
getFloat(fieldName: String)
getDouble(fieldName: String)
getString(fieldName: String)
getDecimal(fieldName: String)

This is a trial change, to make java developers life easier (like me...*_*), as 
we won't benefit from generic feature on getAs[T] 

> Allow to fetch field values by name in sql.Row
> --
>
> Key: SPARK-6994
> URL: https://issues.apache.org/jira/browse/SPARK-6994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: vidmantas zemleris
>Assignee: vidmantas zemleris
>Priority: Minor
>  Labels: dataframe, row
> Fix For: 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It looked weird that up to now there was no way in Spark's Scala API to 
> access fields of `DataFrame/sql.Row` by name, only by their index.
> This tries to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6994) Allow to fetch field values by name in sql.Row

2015-04-28 Thread Shuai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517882#comment-14517882
 ] 

Shuai Zheng commented on SPARK-6994:


I create one more pull request:
https://github.com/apache/spark/pull/5754
add few helper method to access field value by name also return type.
Basically create:
getBoolean(fieldName: String)
getByte(fieldName: String)
getShort(fieldName: String)
getInt(fieldName: String)
getLong(fieldName: String)
getFloat(fieldName: String)
getDouble(fieldName: String)
getString(fieldName: String)
getDecimal(fieldName: String)

This is a trial change, to make java developers life easier (like me...*_*), as 
we won't benefit from generic feature on getAs[T] 

> Allow to fetch field values by name in sql.Row
> --
>
> Key: SPARK-6994
> URL: https://issues.apache.org/jira/browse/SPARK-6994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: vidmantas zemleris
>Assignee: vidmantas zemleris
>Priority: Minor
>  Labels: dataframe, row
> Fix For: 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It looked weird that up to now there was no way in Spark's Scala API to 
> access fields of `DataFrame/sql.Row` by name, only by their index.
> This tries to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6994) Allow to fetch field values by name in sql.Row

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517881#comment-14517881
 ] 

Apache Spark commented on SPARK-6994:
-

User 'szheng79' has created a pull request for this issue:
https://github.com/apache/spark/pull/5754

> Allow to fetch field values by name in sql.Row
> --
>
> Key: SPARK-6994
> URL: https://issues.apache.org/jira/browse/SPARK-6994
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: vidmantas zemleris
>Assignee: vidmantas zemleris
>Priority: Minor
>  Labels: dataframe, row
> Fix For: 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It looked weird that up to now there was no way in Spark's Scala API to 
> access fields of `DataFrame/sql.Row` by name, only by their index.
> This tries to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6197) handle json parse exception for eventlog file not finished writing

2015-04-28 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517866#comment-14517866
 ] 

Andrew Or commented on SPARK-6197:
--

https://github.com/apache/spark/pull/5736

> handle json parse exception for eventlog file not finished writing 
> ---
>
> Key: SPARK-6197
> URL: https://issues.apache.org/jira/browse/SPARK-6197
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> This is a following JIRA for 
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In  
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can 
> display event log files that with suffix *.inprogress*. However, the eventlog 
> file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In 
> which case, the file maybe  truncated in the last line, leading to the line 
> being not in valid Json format. Which will cause Json parse exception when 
> reading the file. 
> For this case, we can just ignore the last line content, since the history 
> for abnormal cases showed on web is only a reference for user, it can 
> demonstrate the past status of the app before terminated abnormally (we can 
> not guarantee the history can show exactly the last moment when app encounter 
> the abnormal situation). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing

2015-04-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6197:
-
Fix Version/s: 1.3.1

> handle json parse exception for eventlog file not finished writing 
> ---
>
> Key: SPARK-6197
> URL: https://issues.apache.org/jira/browse/SPARK-6197
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> This is a following JIRA for 
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In  
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can 
> display event log files that with suffix *.inprogress*. However, the eventlog 
> file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In 
> which case, the file maybe  truncated in the last line, leading to the line 
> being not in valid Json format. Which will cause Json parse exception when 
> reading the file. 
> For this case, we can just ignore the last line content, since the history 
> for abnormal cases showed on web is only a reference for user, it can 
> demonstrate the past status of the app before terminated abnormally (we can 
> not guarantee the history can show exactly the last moment when app encounter 
> the abnormal situation). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517858#comment-14517858
 ] 

Chris Fregly commented on SPARK-7178:
-

added this to the forums to address the AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

> Improve DataFrame documentation and code samples
> 
>
> Key: SPARK-7178
> URL: https://issues.apache.org/jira/browse/SPARK-7178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Chris Fregly
>  Labels: dataframe
>
> AND and OR are not straightforward when using the new DataFrame API.
> the current convention - accepted by Pandas users - is to use the bitwise & 
> and | instead of AND and OR.  when using these, however, you need to wrap 
> each expression in parenthesis to prevent the bitwise operator from 
> dominating.
> also, working with StructTypes is a bit confusing.  the following link:  
> https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
>  (Python tab) implies that you can work with tuples directly when creating a 
> DataFrame.
> however, the following code errors out unless we explicitly use Row's:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> # The schema is encoded in a string.
> schemaString = "a"
> fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
> field_name in schemaString.split()]
> schema = StructType(fields)
> df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-04-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5389:
-
Component/s: Windows
 PySpark

> spark-shell.cmd does not run from DOS Windows 7
> ---
>
> Key: SPARK-5389
> URL: https://issues.apache.org/jira/browse/SPARK-5389
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Windows
>Affects Versions: 1.2.0
> Environment: Windows 7
>Reporter: Yana Kadiyska
> Attachments: SparkShell_Win7.JPG, spark_bug.png
>
>
> spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
> spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
> Marking as trivial since calling spark-shell2.cmd also works fine
> Attaching a screenshot since the error isn't very useful:
> {code}
> spark-1.2.0-bin-cdh4>bin\spark-shell.cmd
> else was unexpected at this time.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Manoj Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7202:
---
Priority: Minor  (was: Major)

> Add SparseMatrixPickler to SerDe
> 
>
> Key: SPARK-7202
> URL: https://issues.apache.org/jira/browse/SPARK-7202
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7195) Can't start spark shell or pyspark in Windows 7

2015-04-28 Thread Mark Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517826#comment-14517826
 ] 

Mark Smiley commented on SPARK-7195:


Sean,

I added my bug as a comment to the old bug and attached my file there.

Can you add PySpark as one of the components involved?  I don't have permission 
to do that.

Thanks,
Mark



> Can't start spark shell or pyspark in Windows 7
> ---
>
> Key: SPARK-7195
> URL: https://issues.apache.org/jira/browse/SPARK-7195
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 1.3.1
> Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala 
> 2.11.6, Python 2.7
>Reporter: Mark Smiley
> Attachments: spark_bug.png
>
>
> cd\spark\bin dir
> spark-shell
> yields following error:
> find: 'version': No such file or directory
> else was unexpected at this time
> Same error with 
> spark-shell2.cmd
> PyShell starts but with errors and doesn't work properly once started
> (e.g., can't find sc). Can send screenshot of errors on request.
> Using Spark 1.3.1 for Hadoop 2.6 binary
> Note: Hadoop not installed on machine.
> Scala works by itself, Python works by itself
> Java works fine (I use it all the time)
> Based on another comment, tried Java 7 (1.7.0_79), but it made no difference 
> (same error).
> JAVA_HOME = C:\jdk1.8.0\bin
> C:\jdk1.8.0\bin\;C:\Program Files 
> (x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program
>  Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS 
> Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
>  Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program 
> Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files 
> (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files 
> (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program 
> Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access 
> Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software 
> Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software 
> Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program 
> Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX 
> 2.9\miktex\bin\x64\;C:\Program Files 
> (x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >