date:20140710


 [ 
https://issues.apache.org/jira/browse/SPARK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2409.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Make SQLConf thread safe
 

 Key: SPARK-2409
 URL: https://issues.apache.org/jira/browse/SPARK-2409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs


 [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2360:
---

Issue Type: New Feature  (was: Bug)

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki
Priority: Minor

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2115) Stage kill link is too close to stage details link


 [ 
https://issues.apache.org/jira/browse/SPARK-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2115.


   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Masayoshi TSUZUKI

 Stage kill link is too close to stage details link
 --

 Key: SPARK-2115
 URL: https://issues.apache.org/jira/browse/SPARK-2115
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Assignee: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.1.0

 Attachments: Accident-prone kill link.png, Safe and user-friendly 
 kill link.png


 1.0.0 introduces a new and very helpful `kill` link for each active stage in 
 the Web UI.
 The problem is that this link is awfully close to the link one would click on 
 to see details for the stage. Without any confirmation dialog for the kill, I 
 think it is likely that users will accidentally kill stages when all they 
 really wanted was to see more information about them.
 I suggest moving the kill link to the right of the stage detail link, and 
 having it flush right within the cell. This puts a safe amount of space 
 between the two links while keeping the kill link in a consistent location.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2418) Custom checkpointing with an external function as parameter

2014-07-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057361#comment-14057361
 ] 

András Barják commented on SPARK-2418:
--

A pull request for a simple implementation using a function as parameter for 
checkpoint():
https://github.com/apache/spark/pull/1345

 Custom checkpointing with an external function as parameter
 ---

 Key: SPARK-2418
 URL: https://issues.apache.org/jira/browse/SPARK-2418
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: András Barják

 If a job consists of many shuffle heavy transformations the current 
 resilience model might be unsatisfactory. In our current use-case we need a 
 persistent checkpoint that we can use to save our RDDs on disk in a custom 
 location and load it back even if the driver dies. (Possible other use cases: 
 store the checkpointed data in various formats: SequenceFile, csv, Parquet 
 file, MySQL etc.)
 After talking to [~pwendell] at the Spark Summit 2014 we concluded that a 
 checkpoint where one can customize the saving and RDD reloading behavior can 
 be a good solution. I am open to further suggestions if you have better ideas 
 about how to make checkpointing more flexible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2428) Add except and intersect methods to SchemaRDD.

2014-07-10 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-2428:


 Summary: Add except and intersect methods to SchemaRDD.
 Key: SPARK-2428
 URL: https://issues.apache.org/jira/browse/SPARK-2428
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2429) Hierarchical Implementation of KMeans

2014-07-10 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-2429:
-

 Summary: Hierarchical Implementation of KMeans
 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor


Hierarchical clustering algorithms are widely used and would make a nice 
addition to MLlib.  Clustering algorithms are useful for determining 
relationships between clusters as well as offering faster assignment. 
Discussion on the dev list suggested the following possible approaches:

* Top down, recursive application of KMeans
* Reuse DecisionTree implementation with different objective function
* Hierarchical SVD

It was also suggested that support for distance metrics other than Euclidean 
such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-07-10 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-2430:
-

 Summary: Standarized Clustering Algorithm API and Framework
 Key: SPARK-2430
 URL: https://issues.apache.org/jira/browse/SPARK-2430
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor


Recently, there has been a chorus of voices on the mailing lists about adding 
new clustering algorithms to MLlib.  To support these additions, we should 
develop a common framework and API to reduce code duplication and keep the APIs 
consistent.

At the same time, we can also expand the current API to incorporate requested 
features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's

2014-07-10 Thread Aaron (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057508#comment-14057508
 ] 

Aaron commented on SPARK-911:
-

I was just wondering about this when answering 
http://stackoverflow.com/questions/24677180/how-do-i-select-a-range-of-elements-in-spark-rdd,
 I noticed that RangePartitioner stores an array of upper bounds, but I was 
unsure from the docs if that meant that the each partition had no overlap in 
range. If they don't this seems like it would be a pretty easy feature to 
implement

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell

 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.

2014-07-10 Thread jay vyas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057511#comment-14057511
 ] 

jay vyas commented on SPARK-1807:
-

Question: Are you saying the {{SPARK_EXECUTOR_URI}} should be a URI that points 
to a script?  Or that we can replace the URI with a simple path to an 
executable script.  

Proposed Alternative: Take  this extension to its logical extreme, and actually 
deprecate the SPARK_EXECUTOR_URI parameter, and instead have a parameter 
{{SPARK_EXECUTOR_CLASS}}, which is invoked with any number of args, one of 
which could be SPARK_EXECUTOR_URI.  

Something like this: 
{noformat}
val c = Class.forName(conf.get(SPARK_EXECUTOR_CLASS))
val m = c.getMethod(startSpark,classOf[SparkExecutor])
m.invoke(null,conf.get(SPARK_EXECUTOR_URI))
{noformat}

The advantage of using a class, rather than a shell script, for this - is that 
we are gauranteed off the bat that the class is providing a contract - to start 
spark - rather than just a one off script which could do any number of things.  

Also it means that various classes for this can be maintained and tested over 
time as first class spark environment providers.



 Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
 -

 Key: SPARK-1807
 URL: https://issues.apache.org/jira/browse/SPARK-1807
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.9.0
Reporter: Timothy St. Clair

 Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an 
 executable script.  This allows admins to launch spark in any fashion they 
 desire, vs. just tarball fetching + implied context.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-10 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057513#comment-14057513
 ] 

RJ Nowling commented on SPARK-2308:
---

That sounds like a good idea for a test.  I'll report back.

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2390) Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS


 [ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2390:
--

Summary: Files in .sparkStaging on HDFS cannot be deleted and wastes the 
space of HDFS  (was: Files in staging directory cannot be deleted and wastes 
the space of HDFS)

 Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS
 -

 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Kousuke Saruta

 When running jobs with YARN Cluster mode and using HistoryServer, the files 
 in the Staging Directory cannot be deleted.
 HistoryServer uses directory where event log is written, and the directory is 
 represented as a instance of o.a.h.f.FileSystem created by using 
 FileSystem.get.
 {code:title=FileLogger.scala}
 private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
 {code}
 {code:title=utils.getHadoopFileSystem}
 def getHadoopFileSystem(path: URI): FileSystem = {
   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
 }
 {code}
 On the other hand, ApplicationMaster has a instance named fs, which also 
 created by using FileSystem.get.
 {code:title=ApplicationMaster}
 private val fs = FileSystem.get(yarnConf)
 {code}
 FileSystem.get returns cached same instance when URI passed to the method 
 represents same file system and the method is called by same user.
 Because of the behavior, when the directory for event log is on HDFS, fs of 
 ApplicationMaster and fileSystem of FileLogger is same instance.
 When shutting down ApplicationMaster, fileSystem.close is called in 
 FileLogger#stop, which is invoked by SparkContext#stop indirectly.
 {code:title=FileLogger.stop}
 def stop() {
   hadoopDataStream.foreach(_.close())
   writer.foreach(_.close())
   fileSystem.close()
 }
 {code}
 And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
 this method, fs.delete(stagingDirPath) is invoked. 
 Because fs.delete in ApplicationMaster is called after fileSystem.close in 
 FileLogger, fs.delete fails and results not deleting files in the staging 
 directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.

2014-07-10 Thread Timothy St. Clair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592
 ] 

Timothy St. Clair commented on SPARK-1807:
--

I'm saying that the URI should *only* have an implied context of a tarball if 
it is a tarball, otherwise if it's a script then it should be able to detect 
and executed. 

I think simple extension detection should work and be backwards compatible. 

 Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
 -

 Key: SPARK-1807
 URL: https://issues.apache.org/jira/browse/SPARK-1807
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.9.0
Reporter: Timothy St. Clair

 Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an 
 executable script.  This allows admins to launch spark in any fashion they 
 desire, vs. just tarball fetching + implied context.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.

2014-07-10 Thread Timothy St. Clair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592
 ] 

Timothy St. Clair edited comment on SPARK-1807 at 7/10/14 3:40 PM:
---

I'm saying that the URI should *only* have an implied context of a tarball if 
it is a tarball, otherwise if it's a script then it should be able to detect 
and execut. 

I think simple extension detection should work and be backwards compatible. 


was (Author: tstclair):
I'm saying that the URI should *only* have an implied context of a tarball if 
it is a tarball, otherwise if it's a script then it should be able to detect 
and executed. 

I think simple extension detection should work and be backwards compatible. 

 Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
 -

 Key: SPARK-1807
 URL: https://issues.apache.org/jira/browse/SPARK-1807
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.9.0
Reporter: Timothy St. Clair

 Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an 
 executable script.  This allows admins to launch spark in any fashion they 
 desire, vs. just tarball fetching + implied context.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.

2014-07-10 Thread Timothy St. Clair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592
 ] 

Timothy St. Clair edited comment on SPARK-1807 at 7/10/14 3:40 PM:
---

I'm saying that the URI should *only* have an implied context of a tarball if 
it is a tarball, otherwise if it's a script then it should be able to detect 
and execute. 

I think simple extension detection should work and be backwards compatible. 


was (Author: tstclair):
I'm saying that the URI should *only* have an implied context of a tarball if 
it is a tarball, otherwise if it's a script then it should be able to detect 
and execut. 

I think simple extension detection should work and be backwards compatible. 

 Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
 -

 Key: SPARK-1807
 URL: https://issues.apache.org/jira/browse/SPARK-1807
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.9.0
Reporter: Timothy St. Clair

 Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an 
 executable script.  This allows admins to launch spark in any fashion they 
 desire, vs. just tarball fetching + implied context.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2431) Refine StringComparison and related codes.

2014-07-10 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057640#comment-14057640
 ] 

Takuya Ueshin commented on SPARK-2431:
--

PRed: https://github.com/apache/spark/pull/1357

 Refine StringComparison and related codes.
 --

 Key: SPARK-2431
 URL: https://issues.apache.org/jira/browse/SPARK-2431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 Refine {{StringComparison}} and related codes as follows:
 - {{StringComparison}} could be similar to {{StringRegexExpression}} or 
 {{CaseConversionExpression}}.
 - Nullability of {{StringRegexExpression}} could depend on children's 
 nullabilities.
 - Add a case that the like condition includes no wildcard to 
 {{LikeSimplification}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2432) Apriori algorithm for frequent itemset mining

2014-07-10 Thread Denis LUKOVNIKOV (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057647#comment-14057647
 ] 

Denis LUKOVNIKOV commented on SPARK-2432:
-

Can I have this assigned to me please?
I already have the implementation (and the test) in Scala, just need some time 
to clean it up, write some comments, add documentation and make a pull request.

 Apriori algorithm for frequent itemset mining
 -

 Key: SPARK-2432
 URL: https://issues.apache.org/jira/browse/SPARK-2432
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Denis LUKOVNIKOV

 A parallel implementation of the apriori algorithm.
 Apriori is a well-known and simple algorithm that finds frequent itemsets and 
 lends itself perfectly for a parallel implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2433) The MLlib implementation for Naive Bayes in Spark 0.9.1 is having a implementation bug.

Rahul K Bhojwani created SPARK-2433:
---

 Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 
is having a implementation bug.
 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani


Don't have much experience with reporting errors. This is first time. If 
something is not clear please feel free to contact me (details given below)

In the pyspark mllib library. 
Path : \spark-0.9.1\python\pyspark\mllib\classification.py

Class: NaiveBayesModel

Method:  self.predict

Earlier Implementation:
def predict(self, x):
Return the most likely class for a data vector x
return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))


New Implementation:
No:1
def predict(self, x):
Return the most likely class for a data vector x
return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

No:2
def predict(self, x):
Return the most likely class for a data vector x
return numpy.argmax(self.pi + dot(x,self.theta.T))

Explanation:
No:1 is correct according to me. Don't know about No:2.

Error one:
The matrix self.theta is of dimension [n_classes , n_features]. 
while the matrix x is of dimension [1 , n_features].

Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
It will always give error:  ValueError: matrices are not aligned
In the commented example given in the classification.py, n_classes = n_features 
= 2. That's why no error.

Both Implementation no.1 and Implementation no. 2 takes care of it.

Error 2:
As basic implementation of naive bayes is: P(class_n | sample) = 
count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)

and taking the class with max value.
That's what implementation 1 is doing.

In Implementation 2: 
Its basically class with max value :
( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
P(feature_n|class_n) * P(class_n))

Don't know if it gives the exact result.

Thanks
Rahul Bhojwani
rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2433) The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


 [ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul K Bhojwani updated SPARK-2433:


Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having 
an implementation bug.  (was: The MLlib implementation for Naive Bayes in Spark 
0.9.1 is having a implementation bug.)

 The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2434) Generate runtime warnings for naive implementations

2014-07-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2434:
-

Summary: Generate runtime warnings for naive implementations  (was: Mark 
MLlib examples)

 Generate runtime warnings for naive implementations
 ---

 Key: SPARK-2434
 URL: https://issues.apache.org/jira/browse/SPARK-2434
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 There are some example code under src/main/scala/org/apache/spark/examples:
 * LocalALS
 * LocalFileLR
 * LocalKMeans
 * LocalLP
 * SparkALS
 * SparkHdfsLR
 * SparkKMeans
 * SparkLR
 They provide naive implementations of some machine learning algorithms that 
 are already covered in MLlib. It is okay to keep them because the 
 implementation is straightforward and easy to read. However, we should 
 generate warning messages at runtime and point users to MLlib's 
 implementation, in case users use them in practice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst

2014-07-10 Thread William Benton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057725#comment-14057725
 ] 

William Benton commented on SPARK-2407:
---

Here's the PR: https://github.com/apache/spark/pull/1359


 Implement SQL SUBSTR() directly in Catalyst
 ---

 Key: SPARK-2407
 URL: https://issues.apache.org/jira/browse/SPARK-2407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Currently SQL SUBSTR/SUBSTRING() is delegated to Hive.  It would be nice to 
 implement this directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2435) Add shutdown hook to bin/pyspark

Andrew Or created SPARK-2435:


 Summary: Add shutdown hook to bin/pyspark
 Key: SPARK-2435
 URL: https://issues.apache.org/jira/browse/SPARK-2435
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


We currently never stop the SparkContext cleanly in bin/pyspark unless the user 
explicitly runs sc.stop(). This behavior is not consistent with 
bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting 
the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057746#comment-14057746
 ] 

Sean Owen commented on SPARK-2433:
--

Your earlier implementation is identical to new implementation 1. This does 
not appear to be the code in master, and I think it's only useful to propose 
fixes to the current version of code.

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1776) Have Spark's SBT build read dependencies from Maven


 [ 
https://issues.apache.org/jira/browse/SPARK-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1776.


Resolution: Fixed

Issue resolved by pull request 772
[https://github.com/apache/spark/pull/772]

 Have Spark's SBT build read dependencies from Maven
 ---

 Key: SPARK-1776
 URL: https://issues.apache.org/jira/browse/SPARK-1776
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 We've wanted to consolidate Spark's build for a while see 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3c39343fa4-3cf4-4349-99e7-2b20e1aed...@gmail.com%3E]
  and 
 [here|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html].
 I'd like to propose using the sbt-pom-reader plug-in to allow us to keep our 
 sbt build (for ease of development) while also holding onto our Maven build 
 which almost all downstream packagers use.
 I've prototyped this a bit locally and I think it's do-able, but will require 
 making some contributions to the sbt-pom-reader plugin. Josh Suereth who 
 maintains both sbt and the plug-in has agreed to help merge any patches we 
 need for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-10 Thread Jonathan Kelly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057763#comment-14057763
 ] 

Jonathan Kelly commented on SPARK-1981:
---

I work for Amazon, but unfortunately I don't know the correct person/team 
internally to contact regarding the licensing.  :)  I'll look into that, but 
I've been busy the past few days.

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux commented on SPARK-2433:
-

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:47 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:50 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

see 
https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:52 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463



was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

see 
https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2436) Apply size-based optimization to planning BroadcastNestedLoopJoin

2014-07-10 Thread Zongheng Yang (JIRA)

Zongheng Yang created SPARK-2436:


 Summary: Apply size-based optimization to planning 
BroadcastNestedLoopJoin
 Key: SPARK-2436
 URL: https://issues.apache.org/jira/browse/SPARK-2436
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Zongheng Yang


We should estimate the sizes of both the left side  right side, and make the 
smaller one the broadcast relation.

As part of this task, BroadcastNestedLoopJoin needs to be slightly refactored 
to take a BuildSide into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:57 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463

You might want to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for a 
next time.


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463


 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057815#comment-14057815
 ] 

Rahul K Bhojwani commented on SPARK-2433:
-

Okay fine. I will take care of the proper procedure from next time. Looks like 
its been already taken care.
 Thank you

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057814#comment-14057814
 ] 

Rahul K Bhojwani commented on SPARK-2433:
-

Sorry my mistake.
By earlier implementation I mean how its already implemented.
and that is:
def predict(self, x):
Return the most likely class for a data vector x
return numpy.argmax(self.pi + dot(x,self.theta))








-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057824#comment-14057824
 ] 

Tathagata Das commented on SPARK-2345:
--

I must be missing something. If you want to run a Spark job inside the 
DStream.foreach function, you can just call any RDD action (count, collect, 
first, saveAs***File, etc.) inside that foreach function. context.runJob does 
not need to be called *explicitly*. All of the exisinting RDD actions 
(including the most generic rdd.foreachPartition) should be mostly sufficient 
for all requirements.

Maybe adding a code example of what you intend to do will help us disambiguate 
this?

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-10 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057847#comment-14057847
 ] 

Chris Fregly commented on SPARK-1981:
-

hey guys-

i'm in the final phases of cleanup.  i refactored quite a bit of the original 
code to make things more testable - and easier to understand.  

oh, and i did, indeed, choose the optional-module route.  we'll address the 
additional complexity through documentation.  that's what i'm working on right 
now, actually.

hoping to submit the PR by tomorrow or this weekend at the very latest.  

the goal is to get this in to the 1.1 release which has a timeline outlined 
here:  https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage 

thanks!

-chris

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057857#comment-14057857
 ] 

Patrick Wendell commented on SPARK-1458:


[~nchammas] I updated the JIRA title to reflect the scope. We should just add 
this in PySpark, should be an easy fix!

 Expose sc.version in PySpark
 

 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor

 As discussed 
 [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
  I think it would be nice if there was a way to programmatically determine 
 what version of Spark you are running. 
 The potential use cases are not that important, but they include:
 # Branching your code based on what version of Spark is running.
 # Checking your version without having to quit and restart the Spark shell.
 Right now in PySpark, I believe the only way to determine your version is by 
 firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version


 [ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1458:
---

Fix Version/s: (was: 1.0.0)

 Add programmatic way to determine Spark version
 ---

 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor

 As discussed 
 [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
  I think it would be nice if there was a way to programmatically determine 
 what version of Spark you are running. 
 The potential use cases are not that important, but they include:
 # Branching your code based on what version of Spark is running.
 # Checking your version without having to quit and restart the Spark shell.
 Right now in PySpark, I believe the only way to determine your version is by 
 firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version


 [ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1458:
---

Target Version/s: 1.1.0

 Add programmatic way to determine Spark version
 ---

 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor

 As discussed 
 [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
  I think it would be nice if there was a way to programmatically determine 
 what version of Spark you are running. 
 The potential use cases are not that important, but they include:
 # Branching your code based on what version of Spark is running.
 # Checking your version without having to quit and restart the Spark shell.
 Right now in PySpark, I believe the only way to determine your version is by 
 firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057872#comment-14057872
 ] 

Urban Škudnik commented on SPARK-1981:
--

Excellent, looking forward already! :)

I'm about to start evaluating Kinesis integration with our spark systems so if 
you need someone to try out the documentation and test the integration, let me 
know.

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083


[ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057894#comment-14057894
 ] 

Tathagata Das commented on SPARK-1642:
--

1478 is merged. Do you have this ready?

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
 ---

 Key: SPARK-1642
 URL: https://issues.apache.org/jira/browse/SPARK-1642
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor
 Fix For: 1.1.0


 This will add support for SSL encryption between Flume AvroSink and Spark 
 Streaming.
 It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


[ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057897#comment-14057897
 ] 

Tathagata Das commented on SPARK-1478:
--

After much hoops, this was the PR that merged this feature.

https://github.com/apache/spark/pull/1347

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor
 Fix For: 1.1.0


 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Description: 
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream = iterator function.

  was:
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot
- Ordering and parallelism of the output operations
- Receiver's should be serializable


 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2437) Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES

Patrick Wendell created SPARK-2437:
--

 Summary: Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add 
SBT_MAVEN_PROPERTIES
 Key: SPARK-2437
 URL: https://issues.apache.org/jira/browse/SPARK-2437
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma


Right now calling it MAVEN_PROFILES is confusing because it's actually only 
used by sbt and _not_ maven!

Also, we should have a similar SBT_MAVEN_PROPERTIES. In some cases people will 
want to set these automatically if they are e.g. building Hadoop for a specific 
minor version.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-1785) Streaming requires receivers to be serializable


 [ 
https://issues.apache.org/jira/browse/SPARK-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-1785.


Resolution: Fixed

 Streaming requires receivers to be serializable
 ---

 Key: SPARK-1785
 URL: https://issues.apache.org/jira/browse/SPARK-1785
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Hari Shreedharan

 When the ReceiverTracker starts the receivers it creates a temporary RDD to  
 send the receivers over to the workers. Then they are started on the workers  
 using a the startReceivers method.
 Looks like this means that the receivers have to really be serializable. In 
 case of the Flume receiver, the Avro IPC components are not serializable 
 causing an error that looks like this:
 {code}
 Exception in thread Thread-46 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: 
 org.apache.avro.ipc.specific.SpecificResponder
   - field (class org.apache.spark.streaming.flume.FlumeReceiver, name: 
 responder, type: class org.apache.avro.ipc.specific.SpecificResponder)
   - object (class org.apache.spark.streaming.flume.FlumeReceiver, 
 org.apache.spark.streaming.flume.FlumeReceiver@5e6bbb36)
   - element of array (index: 0)
   - array (class [Lorg.apache.spark.streaming.receiver.Receiver;, size: 
 1)
   - field (class scala.collection.mutable.WrappedArray$ofRef, name: 
 array, type: class [Ljava.lang.Object;)
   - object (class scala.collection.mutable.WrappedArray$ofRef, 
 WrappedArray(org.apache.spark.streaming.flume.FlumeReceiver@5e6bbb36))
   - field (class org.apache.spark.rdd.ParallelCollectionPartition, 
 name: values, type: interface scala.collection.Seq)
   - custom writeObject data (class 
 org.apache.spark.rdd.ParallelCollectionPartition)
   - object (class org.apache.spark.rdd.ParallelCollectionPartition, 
 org.apache.spark.rdd.ParallelCollectionPartition@691)
   - writeExternal data
   - root object (class org.apache.spark.scheduler.ResultTask, 
 ResultTask(0, 0))
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 A way out of this is to simply send the class name (or .class) to the workers 
 in the tempRDD and have the workers instantiate and start the receiver.
 My analysis maybe wrong. but if it makes sense, I will submit a PR to fix 
 this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2157) Can't write tight firewall rules for Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2157:
---

Assignee: Andrew Ash

 Can't write tight firewall rules for Spark
 --

 Key: SPARK-2157
 URL: https://issues.apache.org/jira/browse/SPARK-2157
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash
Priority: Critical

 In order to run Spark in places with strict firewall rules, you need to be 
 able to specify every port that's used between all parts of the stack.
 Per the [network activity section of the 
 docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security]
  most of the ports are configurable, but there are a few ports that aren't 
 configurable.
 We need to make every port configurable to a particular port, so that we can 
 run Spark in highly locked-down environments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2157) Can't write tight firewall rules for Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2157:
---

Target Version/s: 1.1.0

 Can't write tight firewall rules for Spark
 --

 Key: SPARK-2157
 URL: https://issues.apache.org/jira/browse/SPARK-2157
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Priority: Critical

 In order to run Spark in places with strict firewall rules, you need to be 
 able to specify every port that's used between all parts of the stack.
 Per the [network activity section of the 
 docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security]
  most of the ports are configurable, but there are a few ports that aren't 
 configurable.
 We need to make every port configurable to a particular port, so that we can 
 run Spark in highly locked-down environments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2438) Streaming + MLLib

2014-07-10 Thread Jeremy Freeman (JIRA)

Jeremy Freeman created SPARK-2438:
-

 Summary: Streaming + MLLib
 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jeremy Freeman


This is a ticket to track progress on developing streaming analyses in MLLib.

Many streaming applications benefit from or require fitting models online, 
where the parameters of a model (e.g. regression, clustering) are updated 
continually as new data arrive. This can be accomplished by incorporating MLLib 
algorithms into model-updating operations over DStreams. In some cases this can 
be achieved using existing updaters (e.g. those based on SGD), but in other 
cases will require custom update rules (e.g. for KMeans). The goal is to have 
streaming versions of many common algorithms, in particular regression, 
classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-10 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057954#comment-14057954
 ] 

Hari Shreedharan commented on SPARK-2345:
-

If I have to write the data out as say Avro or any other format, there is no 
easy way to do this. saveAsHadoop currently does only Sequence Files AFAIK. 
Adding support for a a dstream that can run the function on executors without 
the implementor worrying about it would enable this. 

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2439) Actually close FileSystem in Master

Andrew Or created SPARK-2439:


 Summary: Actually close FileSystem in Master
 Key: SPARK-2439
 URL: https://issues.apache.org/jira/browse/SPARK-2439
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.1.0


This was intended, but never actually materialized in code...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2411:
-

Attachment: (was: Screen Shot 2014-07-08 at 4.23.51 PM.png)

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2411:
-

Attachment: Master event logs.png

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Attachments: Master event logs.png


 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2440) Enable HistoryServer to display lots of Application History

Kousuke Saruta created SPARK-2440:
-

 Summary: Enable HistoryServer to display lots of Application 
History
 Key: SPARK-2440
 URL: https://issues.apache.org/jira/browse/SPARK-2440
 Project: Spark
  Issue Type: Improvement
Reporter: Kousuke Saruta
 Fix For: 1.0.0


In current implementation of HistoryServer, it can display 250 records by 
default.
Sometimes we'd like to see over 250 records and configure to be able to list 
more records, but current implementation lists all the records just in one 
page. This is not useful.

And to make matters worse, initial launch of HistoryServer is very slowly. 





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-944:


Assignee: Ted Malaska  (was: Tathagata Das)

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Ted Malaska
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred


 [ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-1667:
--

Comment: was deleted

(was: Now I'm trying to address this issue.)

 Jobs never finish successfully once bucket file missing occurred
 

 Key: SPARK-1667
 URL: https://issues.apache.org/jira/browse/SPARK-1667
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 If jobs execute shuffle, bucket files are created in a temporary directory 
 (named like spark-local-*).
 When the bucket files are missing cased by disk failure or any reasons, jobs 
 cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred


 [ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-1667:
--

Description: 
If jobs execute shuffle, bucket files are created in a temporary directory 
(named like spark-local-*).
When the bucket files are missing cased by disk failure or any reasons, jobs 
cannot execute shuffle which has same shuffle id for the bucket files.

  was:
I met a case that re-fetch wouldn't occur although that should occur.
When intermediate data (phisical file of intermediate data on local file 
system) which is used for shuffle is lost from a Executor, 
FileNotFoundException was thrown and refetch wouldn't occur.


 Jobs never finish successfully once bucket file missing occurred
 

 Key: SPARK-1667
 URL: https://issues.apache.org/jira/browse/SPARK-1667
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 If jobs execute shuffle, bucket files are created in a temporary directory 
 (named like spark-local-*).
 When the bucket files are missing cased by disk failure or any reasons, jobs 
 cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2427) Fix some of the Scala examples


 [ 
https://issues.apache.org/jira/browse/SPARK-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2427.


   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Fix some of the Scala examples
 --

 Key: SPARK-2427
 URL: https://issues.apache.org/jira/browse/SPARK-2427
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.0.0
Reporter: Constantin Ahlmann
 Fix For: 1.1.0, 1.0.2


 The Scala examples HBaseTest and HdfsTest don't use the correct indexes for 
 the command line arguments. This due to to the fix of JIRA 1565, where these 
 examples were not correctly adapted to the new usage of the submit script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1341) Ability to control the data rate in Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058091#comment-14058091
 ] 

Tathagata Das commented on SPARK-1341:
--

Thanks Isaac. This has been merged, as is going to be very useful!

 Ability to control the data rate in Spark Streaming 
 

 Key: SPARK-1341
 URL: https://issues.apache.org/jira/browse/SPARK-1341
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das
 Fix For: 1.1.0, 1.0.2


 To be able to control the rate at which data is received through Spark 
 Streaming's receivers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1341) Ability to control the data rate in Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058091#comment-14058091
 ] 

Tathagata Das edited comment on SPARK-1341 at 7/10/14 11:05 PM:


Thanks Isaac. This has been merged, and is going to be very useful!


was (Author: tdas):
Thanks Isaac. This has been merged, as is going to be very useful!

 Ability to control the data rate in Spark Streaming 
 

 Key: SPARK-1341
 URL: https://issues.apache.org/jira/browse/SPARK-1341
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das
 Fix For: 1.1.0, 1.0.2


 To be able to control the rate at which data is received through Spark 
 Streaming's receivers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2442) Add a Hadoop Writable serializer

2014-07-10 Thread Hari Shreedharan (JIRA)

Hari Shreedharan created SPARK-2442:
---

 Summary: Add a Hadoop Writable serializer
 Key: SPARK-2442
 URL: https://issues.apache.org/jira/browse/SPARK-2442
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan


Using data read from hadoop files in shuffles can cause exceptions with the 
following stacktrace:
{code}
java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
{code}

This though seems to go away if Kyro serializer is used. I am wondering if 
adding a Hadoop-writables friendly serializer makes sense as it is likely to 
perform better than Kyro without registration, since Writables don't implement 
Serializable - so the serialization might not be the most efficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-10 Thread Mubarak Seyed (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058151#comment-14058151
 ] 

Mubarak Seyed commented on SPARK-1853:
--

[~tdas]
Seems like some bug in {code}Utils#getCallSite{code} Testing a fix. Will update 
you shortly. Thanks

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI


[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058159#comment-14058159
 ] 

Tathagata Das commented on SPARK-1853:
--

I dont think it is a bug in that. This is an artifact of the Spark Streaming's 
execution model -- all the DStreams are initially defined in one thread, and 
then RDDs and jobs are created and executed them in another thread. So when the 
thread creating RDDs calls Utils#getCallSite (whic looks at the stack trace to 
figure out the user code that in the call stack), it cannot find any reference 
to the user program as it is not called from the user program. The correct 
solution will probably require each DStream's to store its own call site info 
(where they were defined), and set it explicitly on every RDD that gets created 
by it. 



 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

[
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057824#comment-14057824
]

Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:39 AM:

I must be missing something. If you want to run a Spark job inside the
DStream.foreach function, you can just call any RDD action (count, collect,
first, saveAsFile, etc.) inside that foreach function. context.runJob does
not need to be called *explicitly*. All of the exisinting RDD actions
(including the most generic rdd.foreachPartition) should be mostly sufficient
for all requirements.

Maybe adding a code example of what you intend to do will help us disambiguate
this?

was (Author: tdas):
I must be missing something. If you want to run a Spark job inside the
DStream.foreach function, you can just call any RDD action (count, collect,
first, saveAs***File, etc.) inside that foreach function. context.runJob does
not need to be called *explicitly*. All of the exisinting RDD actions
(including the most generic rdd.foreachPartition) should be mostly sufficient
for all requirements.

Maybe adding a code example of what you intend to do will help us disambiguate
this?

ForEachDStream should have an option of running the foreachfunc on Spark

Key: SPARK-2345
URL: https://issues.apache.org/jira/browse/SPARK-2345
Project: Spark
Issue Type: Bug
Components: Streaming
Reporter: Hari Shreedharan

Today the Job generated simply calls the foreachfunc, but does not run it on
spark itself using the sparkContext.runJob method.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165
 ] 

Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:44 AM:


Then what I suggested earlier makes sense, doesnt it? 

   DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) 

This will effectively rdd.foreachPartition(function) on every RDD generated by 
the DStream.

RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary 
OutputFormat that works with HDFS API.


was (Author: tdas):
Then what I suggested earlier makes sense, doesnt it? 

  DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) 

This will effectively rdd.foreachPartition(function) on every RDD generated by 
the DStream.

RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary 
OutputFormat that works with HDFS API.

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165
 ] 

Tathagata Das commented on SPARK-2345:
--

Then what I suggested earlier makes sense, doesnt it? 

  DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) 

This will effectively rdd.foreachPartition(function) on every RDD generated by 
the DStream.

RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary 
OutputFormat that works with HDFS API.

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165
 ] 

Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:45 AM:


Then what I suggested earlier makes sense, doesnt it? 

   DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) 

This will effectively execute rdd.foreachPartition(function) on every RDD 
generated by the DStream.

RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary 
OutputFormat that works with HDFS API.


was (Author: tdas):
Then what I suggested earlier makes sense, doesnt it? 

   DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) 

This will effectively rdd.foreachPartition(function) on every RDD generated by 
the DStream.

RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary 
OutputFormat that works with HDFS API.

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2443:


Description: 
Here are some numbers, all queries return ~20million:

{code}
SELECT COUNT(*) FROM non partitioned table
5.496467726 s

SELECT COUNT(*) FROM partitioned table stored in parquet
50.26947 s

SELECT COUNT(*) FROM same table as previous but loaded with parquetFile 
instead of through hive
2s
{code}

  was:
Here are some numbers, all queries return ~20million:

SELECT COUNT(*) FROM non partitioned table
5.496467726 s

SELECT COUNT(*) FROM partitioned table stored in parquet
50.26947 s

SELECT COUNT(*) FROM same table as previous but loaded with parquetFile 
instead of through hive
2s


 Reading from Partitioned Tables is Slow
 ---

 Key: SPARK-2443
 URL: https://issues.apache.org/jira/browse/SPARK-2443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang

 Here are some numbers, all queries return ~20million:
 {code}
 SELECT COUNT(*) FROM non partitioned table
 5.496467726 s
 SELECT COUNT(*) FROM partitioned table stored in parquet
 50.26947 s
 SELECT COUNT(*) FROM same table as previous but loaded with parquetFile 
 instead of through hive
 2s
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2298) Show stage attempt in UI

2014-07-10 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058177#comment-14058177
 ] 

Masayoshi TSUZUKI commented on SPARK-2298:
--

Thank you for your reply. Yes, I'm interested in it!

 Show stage attempt in UI
 

 Key: SPARK-2298
 URL: https://issues.apache.org/jira/browse/SPARK-2298
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Reynold Xin
Assignee: Andrew Or
 Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png


 We should add a column to the web ui to show stage attempt id. Then tasks 
 should be grouped by (stageId, stageAttempt) tuple.
 When a stage is resubmitted (e.g. due to fetch failures), we should get a 
 different entry in the web ui and tasks for the resubmission go there.
 See the attached screenshot for the confusing status quo. We currently show 
 the same stage entry twice, and then tasks appear in both. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming

2014-07-10 Thread Ted Malaska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058184#comment-14058184
 ] 

Ted Malaska commented on SPARK-944:
---

Thank you for the assign.

I will start working on this tomorrow.  We will review our approach to HBase 
and Spark and I will write the patch over the weekend.

Thanks again.

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Ted Malaska
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-10 Thread Mubarak Seyed (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058195#comment-14058195
 ] 

Mubarak Seyed commented on SPARK-1853:
--

I just added _streaming.dstream_ in regex and it filters the DStream code 
context but still shows other calls such as _apply at Option.scala, apply at 
List.scala_

{code}
private val SPARK_CLASS_REGEX = 
^org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.streaming\.dstream)?\.[A-Z].r
{code}

You are correct as we can't just keep adding regex for internal code. I will 
start work on the solution as you advised. Thanks

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2444) Make spark.yarn.executor.memoryOverhead a first class citizen

2014-07-10 Thread Nishkam Ravi (JIRA)

Nishkam Ravi created SPARK-2444:
---

 Summary: Make spark.yarn.executor.memoryOverhead a first class 
citizen
 Key: SPARK-2444
 URL: https://issues.apache.org/jira/browse/SPARK-2444
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Nishkam Ravi


Higher value of spark.yarn.executor.memoryOverhead is critical to running Spark 
applications on Yarn (https://issues.apache.org/jira/browse/SPARK-2398) at 
least for 1.0. It would be great to have this parameter highlighted in the 
docs/usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.

2014-07-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2415.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 RowWriteSupport should handle empty ArrayType correctly.
 

 Key: SPARK-2415
 URL: https://issues.apache.org/jira/browse/SPARK-2415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read 
 value becomes {{null}}.
 It should write empty {{ArrayType}} value as it is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2428) Add except and intersect methods to SchemaRDD.

2014-07-10 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2428.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 Add except and intersect methods to SchemaRDD.
 --

 Key: SPARK-2428
 URL: https://issues.apache.org/jira/browse/SPARK-2428
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-10 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298
 ] 

Guoqiang Li edited comment on SPARK-2398 at 7/11/14 3:16 AM:
-

Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to {{false}} .  Its default value is  
{{true}}
 in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]


was (Author: gq):
Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to false.  Its default value is  
{{true}}
 in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]

 Trouble running Spark 1.0 on Yarn 
 --

 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi

 Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
 For example: SparkPageRank when run in standalone mode goes through without 
 any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
 runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
 cluster mode) as the input data size is increased. Confirmed for 16GB input 
 dataset.
 The same workload runs fine with Spark 0.9 in both standalone and yarn 
 cluster mode (for up to 30 GB input dataset on a 6-node cluster).
 Commandline used:
 (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
 --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
 --driver-cores 16 --num-executors 5 --class 
 org.apache.spark.examples.SparkPageRank 
 /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
  pagerank_in $NUM_ITER)
 pagerank.conf:
 spark.masterspark://c1704.halxg.cloudera.com:7077
 spark.home  /opt/cloudera/parcels/CDH/lib/spark
 spark.executor.memory   32g
 spark.default.parallelism   118
 spark.cores.max 96
 spark.storage.memoryFraction0.6
 spark.shuffle.memoryFraction0.3
 spark.shuffle.compress  true
 spark.shuffle.spill.compresstrue
 spark.broadcast.compresstrue
 spark.rdd.compress  false
 spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
 spark.io.compression.snappy.block.size  32768
 spark.reducer.maxMbInFlight 48
 spark.local.dir  /var/lib/jenkins/workspace/tmp
 spark.driver.memory 30g
 spark.executor.cores16
 spark.locality.wait 6000
 spark.executor.instances5
 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
 to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
 java.nio.channels.AsynchronousCloseException
 at 
 java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
 at 
 org.apache.spark.network.SendingConnection.write(Connection.scala:361)
 at 
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-10 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298
 ] 

Guoqiang Li commented on SPARK-2398:


Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to false.  Its default value is  
{{true}}
 in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]

 Trouble running Spark 1.0 on Yarn 
 --

 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi

 Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
 For example: SparkPageRank when run in standalone mode goes through without 
 any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
 runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
 cluster mode) as the input data size is increased. Confirmed for 16GB input 
 dataset.
 The same workload runs fine with Spark 0.9 in both standalone and yarn 
 cluster mode (for up to 30 GB input dataset on a 6-node cluster).
 Commandline used:
 (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
 --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
 --driver-cores 16 --num-executors 5 --class 
 org.apache.spark.examples.SparkPageRank 
 /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
  pagerank_in $NUM_ITER)
 pagerank.conf:
 spark.masterspark://c1704.halxg.cloudera.com:7077
 spark.home  /opt/cloudera/parcels/CDH/lib/spark
 spark.executor.memory   32g
 spark.default.parallelism   118
 spark.cores.max 96
 spark.storage.memoryFraction0.6
 spark.shuffle.memoryFraction0.3
 spark.shuffle.compress  true
 spark.shuffle.spill.compresstrue
 spark.broadcast.compresstrue
 spark.rdd.compress  false
 spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
 spark.io.compression.snappy.block.size  32768
 spark.reducer.maxMbInFlight 48
 spark.local.dir  /var/lib/jenkins/workspace/tmp
 spark.driver.memory 30g
 spark.executor.cores16
 spark.locality.wait 6000
 spark.executor.instances5
 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
 to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
 java.nio.channels.AsynchronousCloseException
 at 
 java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
 at 
 org.apache.spark.network.SendingConnection.write(Connection.scala:361)
 at 
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at

[jira] [Comment Edited] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-10 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298
 ] 

Guoqiang Li edited comment on SPARK-2398 at 7/11/14 3:17 AM:
-

Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to {{false}} .  Its default value is  
{{true}} in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]


was (Author: gq):
Q1. {{-Xmx}} is only the heap space,  {{native library}} and 
{{sun.misc.Unsafe}} can easily allocate memory outside Java heap.
reference  
http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory.
Q2. This is not a bug. We can disable this check by setting 
{{yarn.nodemanager.pmem-check-enabled}} to {{false}} .  Its default value is  
{{true}}
 in 
[yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]

 Trouble running Spark 1.0 on Yarn 
 --

 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi

 Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
 For example: SparkPageRank when run in standalone mode goes through without 
 any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
 runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
 cluster mode) as the input data size is increased. Confirmed for 16GB input 
 dataset.
 The same workload runs fine with Spark 0.9 in both standalone and yarn 
 cluster mode (for up to 30 GB input dataset on a 6-node cluster).
 Commandline used:
 (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
 --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
 --driver-cores 16 --num-executors 5 --class 
 org.apache.spark.examples.SparkPageRank 
 /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
  pagerank_in $NUM_ITER)
 pagerank.conf:
 spark.masterspark://c1704.halxg.cloudera.com:7077
 spark.home  /opt/cloudera/parcels/CDH/lib/spark
 spark.executor.memory   32g
 spark.default.parallelism   118
 spark.cores.max 96
 spark.storage.memoryFraction0.6
 spark.shuffle.memoryFraction0.3
 spark.shuffle.compress  true
 spark.shuffle.spill.compresstrue
 spark.broadcast.compresstrue
 spark.rdd.compress  false
 spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
 spark.io.compression.snappy.block.size  32768
 spark.reducer.maxMbInFlight 48
 spark.local.dir  /var/lib/jenkins/workspace/tmp
 spark.driver.memory 30g
 spark.executor.cores16
 spark.locality.wait 6000
 spark.executor.instances5
 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
 to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
 java.nio.channels.AsynchronousCloseException
 at 
 java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
 at 
 org.apache.spark.network.SendingConnection.write(Connection.scala:361)
 at 
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at

[jira] [Comment Edited] (SPARK-911) Support map pruning on sorted (K, V) RDD's

2014-07-10 Thread Aaron (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058322#comment-14058322
 ] 

Aaron edited comment on SPARK-911 at 7/11/14 3:50 AM:
--

I decided to take a look at this myself (disclaimer haven't worked on sparks 
source code before). A couple things I noticed were:  
1. since compute is called per split for all RDD's, the best I could figure out 
using that approach is to just return an empty iterator, not sure if that 
creates a weird partition situation
2. you could more efficiently check the partitions that could contain correct 
numbers, but with only an iterator you have to linearly search


was (Author: aaronjosephs):
I decided to take a look at this myself (disclaimer haven't worked on sparks 
source code before). A couple things I noticed were:  
1. since compute is called per split for all RDD's, the best I could figure out 
using that approach is to just return an empty iterator
2. you could more efficiently check the partitions that could contain correct 
numbers, but with only an iterator you have to linearly search

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell

 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's

2014-07-10 Thread Aaron (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058322#comment-14058322
 ] 

Aaron commented on SPARK-911:
-

I decided to take a look at this myself (disclaimer haven't worked on sparks 
source code before). A couple things I noticed were:  
1. since compute is called per split for all RDD's, the best I could figure out 
using that approach is to just return an empty iterator
2. you could more efficiently check the partitions that could contain correct 
numbers, but with only an iterator you have to linearly search

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell

 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build