[jira] [Created] (SPARK-2427) Fix some of the Scala examples
Constantin Ahlmann created SPARK-2427: - Summary: Fix some of the Scala examples Key: SPARK-2427 URL: https://issues.apache.org/jira/browse/SPARK-2427 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.0.0 Reporter: Constantin Ahlmann The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2409) Make SQLConf thread safe
[ https://issues.apache.org/jira/browse/SPARK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2409. Resolution: Fixed Fix Version/s: 1.1.0 Make SQLConf thread safe Key: SPARK-2409 URL: https://issues.apache.org/jira/browse/SPARK-2409 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2360: --- Issue Type: New Feature (was: Bug) CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Hossein Falaki Priority: Minor I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2115) Stage kill link is too close to stage details link
[ https://issues.apache.org/jira/browse/SPARK-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2115. Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Masayoshi TSUZUKI Stage kill link is too close to stage details link -- Key: SPARK-2115 URL: https://issues.apache.org/jira/browse/SPARK-2115 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Nicholas Chammas Assignee: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.1.0 Attachments: Accident-prone kill link.png, Safe and user-friendly kill link.png 1.0.0 introduces a new and very helpful `kill` link for each active stage in the Web UI. The problem is that this link is awfully close to the link one would click on to see details for the stage. Without any confirmation dialog for the kill, I think it is likely that users will accidentally kill stages when all they really wanted was to see more information about them. I suggest moving the kill link to the right of the stage detail link, and having it flush right within the cell. This puts a safe amount of space between the two links while keeping the kill link in a consistent location. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2418) Custom checkpointing with an external function as parameter
[ https://issues.apache.org/jira/browse/SPARK-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057361#comment-14057361 ] András Barják commented on SPARK-2418: -- A pull request for a simple implementation using a function as parameter for checkpoint(): https://github.com/apache/spark/pull/1345 Custom checkpointing with an external function as parameter --- Key: SPARK-2418 URL: https://issues.apache.org/jira/browse/SPARK-2418 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: András Barják If a job consists of many shuffle heavy transformations the current resilience model might be unsatisfactory. In our current use-case we need a persistent checkpoint that we can use to save our RDDs on disk in a custom location and load it back even if the driver dies. (Possible other use cases: store the checkpointed data in various formats: SequenceFile, csv, Parquet file, MySQL etc.) After talking to [~pwendell] at the Spark Summit 2014 we concluded that a checkpoint where one can customize the saving and RDD reloading behavior can be a good solution. I am open to further suggestions if you have better ideas about how to make checkpointing more flexible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2428) Add except and intersect methods to SchemaRDD.
Takuya Ueshin created SPARK-2428: Summary: Add except and intersect methods to SchemaRDD. Key: SPARK-2428 URL: https://issues.apache.org/jira/browse/SPARK-2428 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2429) Hierarchical Implementation of KMeans
RJ Nowling created SPARK-2429: - Summary: Hierarchical Implementation of KMeans Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2430) Standarized Clustering Algorithm API and Framework
RJ Nowling created SPARK-2430: - Summary: Standarized Clustering Algorithm API and Framework Key: SPARK-2430 URL: https://issues.apache.org/jira/browse/SPARK-2430 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Recently, there has been a chorus of voices on the mailing lists about adding new clustering algorithms to MLlib. To support these additions, we should develop a common framework and API to reduce code duplication and keep the APIs consistent. At the same time, we can also expand the current API to incorporate requested features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057508#comment-14057508 ] Aaron commented on SPARK-911: - I was just wondering about this when answering http://stackoverflow.com/questions/24677180/how-do-i-select-a-range-of-elements-in-spark-rdd, I noticed that RangePartitioner stores an array of upper bounds, but I was unsure from the docs if that meant that the each partition had no overlap in range. If they don't this seems like it would be a pretty easy feature to implement Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Reporter: Patrick Wendell If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
[ https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057511#comment-14057511 ] jay vyas commented on SPARK-1807: - Question: Are you saying the {{SPARK_EXECUTOR_URI}} should be a URI that points to a script? Or that we can replace the URI with a simple path to an executable script. Proposed Alternative: Take this extension to its logical extreme, and actually deprecate the SPARK_EXECUTOR_URI parameter, and instead have a parameter {{SPARK_EXECUTOR_CLASS}}, which is invoked with any number of args, one of which could be SPARK_EXECUTOR_URI. Something like this: {noformat} val c = Class.forName(conf.get(SPARK_EXECUTOR_CLASS)) val m = c.getMethod(startSpark,classOf[SparkExecutor]) m.invoke(null,conf.get(SPARK_EXECUTOR_URI)) {noformat} The advantage of using a class, rather than a shell script, for this - is that we are gauranteed off the bat that the class is providing a contract - to start spark - rather than just a one off script which could do any number of things. Also it means that various classes for this can be maintained and tested over time as first class spark environment providers. Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos. - Key: SPARK-1807 URL: https://issues.apache.org/jira/browse/SPARK-1807 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.9.0 Reporter: Timothy St. Clair Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an executable script. This allows admins to launch spark in any fashion they desire, vs. just tarball fetching + implied context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057513#comment-14057513 ] RJ Nowling commented on SPARK-2308: --- That sounds like a good idea for a test. I'll report back. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2390) Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2390: -- Summary: Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS (was: Files in staging directory cannot be deleted and wastes the space of HDFS) Files in .sparkStaging on HDFS cannot be deleted and wastes the space of HDFS - Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
[ https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592 ] Timothy St. Clair commented on SPARK-1807: -- I'm saying that the URI should *only* have an implied context of a tarball if it is a tarball, otherwise if it's a script then it should be able to detect and executed. I think simple extension detection should work and be backwards compatible. Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos. - Key: SPARK-1807 URL: https://issues.apache.org/jira/browse/SPARK-1807 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.9.0 Reporter: Timothy St. Clair Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an executable script. This allows admins to launch spark in any fashion they desire, vs. just tarball fetching + implied context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
[ https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592 ] Timothy St. Clair edited comment on SPARK-1807 at 7/10/14 3:40 PM: --- I'm saying that the URI should *only* have an implied context of a tarball if it is a tarball, otherwise if it's a script then it should be able to detect and execut. I think simple extension detection should work and be backwards compatible. was (Author: tstclair): I'm saying that the URI should *only* have an implied context of a tarball if it is a tarball, otherwise if it's a script then it should be able to detect and executed. I think simple extension detection should work and be backwards compatible. Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos. - Key: SPARK-1807 URL: https://issues.apache.org/jira/browse/SPARK-1807 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.9.0 Reporter: Timothy St. Clair Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an executable script. This allows admins to launch spark in any fashion they desire, vs. just tarball fetching + implied context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
[ https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057592#comment-14057592 ] Timothy St. Clair edited comment on SPARK-1807 at 7/10/14 3:40 PM: --- I'm saying that the URI should *only* have an implied context of a tarball if it is a tarball, otherwise if it's a script then it should be able to detect and execute. I think simple extension detection should work and be backwards compatible. was (Author: tstclair): I'm saying that the URI should *only* have an implied context of a tarball if it is a tarball, otherwise if it's a script then it should be able to detect and execut. I think simple extension detection should work and be backwards compatible. Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos. - Key: SPARK-1807 URL: https://issues.apache.org/jira/browse/SPARK-1807 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.9.0 Reporter: Timothy St. Clair Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an executable script. This allows admins to launch spark in any fashion they desire, vs. just tarball fetching + implied context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2431) Refine StringComparison and related codes.
[ https://issues.apache.org/jira/browse/SPARK-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057640#comment-14057640 ] Takuya Ueshin commented on SPARK-2431: -- PRed: https://github.com/apache/spark/pull/1357 Refine StringComparison and related codes. -- Key: SPARK-2431 URL: https://issues.apache.org/jira/browse/SPARK-2431 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Refine {{StringComparison}} and related codes as follows: - {{StringComparison}} could be similar to {{StringRegexExpression}} or {{CaseConversionExpression}}. - Nullability of {{StringRegexExpression}} could depend on children's nullabilities. - Add a case that the like condition includes no wildcard to {{LikeSimplification}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2432) Apriori algorithm for frequent itemset mining
[ https://issues.apache.org/jira/browse/SPARK-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057647#comment-14057647 ] Denis LUKOVNIKOV commented on SPARK-2432: - Can I have this assigned to me please? I already have the implementation (and the test) in Scala, just need some time to clean it up, write some comments, add documentation and make a pull request. Apriori algorithm for frequent itemset mining - Key: SPARK-2432 URL: https://issues.apache.org/jira/browse/SPARK-2432 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Denis LUKOVNIKOV A parallel implementation of the apriori algorithm. Apriori is a well-known and simple algorithm that finds frequent itemsets and lends itself perfectly for a parallel implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2433) The MLlib implementation for Naive Bayes in Spark 0.9.1 is having a implementation bug.
Rahul K Bhojwani created SPARK-2433: --- Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having a implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2433) The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul K Bhojwani updated SPARK-2433: Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. (was: The MLlib implementation for Naive Bayes in Spark 0.9.1 is having a implementation bug.) The MLlib implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2434) Generate runtime warnings for naive implementations
[ https://issues.apache.org/jira/browse/SPARK-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2434: - Summary: Generate runtime warnings for naive implementations (was: Mark MLlib examples) Generate runtime warnings for naive implementations --- Key: SPARK-2434 URL: https://issues.apache.org/jira/browse/SPARK-2434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng There are some example code under src/main/scala/org/apache/spark/examples: * LocalALS * LocalFileLR * LocalKMeans * LocalLP * SparkALS * SparkHdfsLR * SparkKMeans * SparkLR They provide naive implementations of some machine learning algorithms that are already covered in MLlib. It is okay to keep them because the implementation is straightforward and easy to read. However, we should generate warning messages at runtime and point users to MLlib's implementation, in case users use them in practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057725#comment-14057725 ] William Benton commented on SPARK-2407: --- Here's the PR: https://github.com/apache/spark/pull/1359 Implement SQL SUBSTR() directly in Catalyst --- Key: SPARK-2407 URL: https://issues.apache.org/jira/browse/SPARK-2407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Currently SQL SUBSTR/SUBSTRING() is delegated to Hive. It would be nice to implement this directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2435) Add shutdown hook to bin/pyspark
Andrew Or created SPARK-2435: Summary: Add shutdown hook to bin/pyspark Key: SPARK-2435 URL: https://issues.apache.org/jira/browse/SPARK-2435 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 We currently never stop the SparkContext cleanly in bin/pyspark unless the user explicitly runs sc.stop(). This behavior is not consistent with bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting the shell. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057746#comment-14057746 ] Sean Owen commented on SPARK-2433: -- Your earlier implementation is identical to new implementation 1. This does not appear to be the code in master, and I think it's only useful to propose fixes to the current version of code. In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1776) Have Spark's SBT build read dependencies from Maven
[ https://issues.apache.org/jira/browse/SPARK-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1776. Resolution: Fixed Issue resolved by pull request 772 [https://github.com/apache/spark/pull/772] Have Spark's SBT build read dependencies from Maven --- Key: SPARK-1776 URL: https://issues.apache.org/jira/browse/SPARK-1776 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 We've wanted to consolidate Spark's build for a while see [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3c39343fa4-3cf4-4349-99e7-2b20e1aed...@gmail.com%3E] and [here|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html]. I'd like to propose using the sbt-pom-reader plug-in to allow us to keep our sbt build (for ease of development) while also holding onto our Maven build which almost all downstream packagers use. I've prototyped this a bit locally and I think it's do-able, but will require making some contributions to the sbt-pom-reader plugin. Josh Suereth who maintains both sbt and the plug-in has agreed to help merge any patches we need for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057763#comment-14057763 ] Jonathan Kelly commented on SPARK-1981: --- I work for Amazon, but unfortunately I don't know the correct person/team internally to contact regarding the licensing. :) I'll look into that, but I've been busy the past few days. Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux commented on SPARK-2433: - A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:47 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:50 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. see https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759 was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:52 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. see https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2436) Apply size-based optimization to planning BroadcastNestedLoopJoin
Zongheng Yang created SPARK-2436: Summary: Apply size-based optimization to planning BroadcastNestedLoopJoin Key: SPARK-2436 URL: https://issues.apache.org/jira/browse/SPARK-2436 Project: Spark Issue Type: New Feature Components: SQL Reporter: Zongheng Yang We should estimate the sizes of both the left side right side, and make the smaller one the broadcast relation. As part of this task, BroadcastNestedLoopJoin needs to be slightly refactored to take a BuildSide into account. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:57 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 You might want to read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for a next time. was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057815#comment-14057815 ] Rahul K Bhojwani commented on SPARK-2433: - Okay fine. I will take care of the proper procedure from next time. Looks like its been already taken care. Thank you In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057814#comment-14057814 ] Rahul K Bhojwani commented on SPARK-2433: - Sorry my mistake. By earlier implementation I mean how its already implemented. and that is: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta)) -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057824#comment-14057824 ] Tathagata Das commented on SPARK-2345: -- I must be missing something. If you want to run a Spark job inside the DStream.foreach function, you can just call any RDD action (count, collect, first, saveAs***File, etc.) inside that foreach function. context.runJob does not need to be called *explicitly*. All of the exisinting RDD actions (including the most generic rdd.foreachPartition) should be mostly sufficient for all requirements. Maybe adding a code example of what you intend to do will help us disambiguate this? ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057847#comment-14057847 ] Chris Fregly commented on SPARK-1981: - hey guys- i'm in the final phases of cleanup. i refactored quite a bit of the original code to make things more testable - and easier to understand. oh, and i did, indeed, choose the optional-module route. we'll address the additional complexity through documentation. that's what i'm working on right now, actually. hoping to submit the PR by tomorrow or this weekend at the very latest. the goal is to get this in to the 1.1 release which has a timeline outlined here: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage thanks! -chris Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1458) Expose sc.version in PySpark
[ https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057857#comment-14057857 ] Patrick Wendell commented on SPARK-1458: [~nchammas] I updated the JIRA title to reflect the scope. We should just add this in PySpark, should be an easy fix! Expose sc.version in PySpark Key: SPARK-1458 URL: https://issues.apache.org/jira/browse/SPARK-1458 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 0.9.0 Reporter: Nicholas Chammas Priority: Minor As discussed [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html], I think it would be nice if there was a way to programmatically determine what version of Spark you are running. The potential use cases are not that important, but they include: # Branching your code based on what version of Spark is running. # Checking your version without having to quit and restart the Spark shell. Right now in PySpark, I believe the only way to determine your version is by firing up the Spark shell and looking at the startup banner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version
[ https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1458: --- Fix Version/s: (was: 1.0.0) Add programmatic way to determine Spark version --- Key: SPARK-1458 URL: https://issues.apache.org/jira/browse/SPARK-1458 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 0.9.0 Reporter: Nicholas Chammas Priority: Minor As discussed [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html], I think it would be nice if there was a way to programmatically determine what version of Spark you are running. The potential use cases are not that important, but they include: # Branching your code based on what version of Spark is running. # Checking your version without having to quit and restart the Spark shell. Right now in PySpark, I believe the only way to determine your version is by firing up the Spark shell and looking at the startup banner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version
[ https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1458: --- Target Version/s: 1.1.0 Add programmatic way to determine Spark version --- Key: SPARK-1458 URL: https://issues.apache.org/jira/browse/SPARK-1458 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 0.9.0 Reporter: Nicholas Chammas Priority: Minor As discussed [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html], I think it would be nice if there was a way to programmatically determine what version of Spark you are running. The potential use cases are not that important, but they include: # Branching your code based on what version of Spark is running. # Checking your version without having to quit and restart the Spark shell. Right now in PySpark, I believe the only way to determine your version is by firing up the Spark shell and looking at the startup banner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057872#comment-14057872 ] Urban Škudnik commented on SPARK-1981: -- Excellent, looking forward already! :) I'm about to start evaluating Kinesis integration with our spark systems so if you need someone to try out the documentation and test the integration, let me know. Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
[ https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057894#comment-14057894 ] Tathagata Das commented on SPARK-1642: -- 1478 is merged. Do you have this ready? Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083 --- Key: SPARK-1642 URL: https://issues.apache.org/jira/browse/SPARK-1642 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor Fix For: 1.1.0 This will add support for SSL encryption between Flume AvroSink and Spark Streaming. It is based on FLUME-2083 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
[ https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057897#comment-14057897 ] Tathagata Das commented on SPARK-1478: -- After much hoops, this was the PR that merged this feature. https://github.com/apache/spark/pull/1347 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 --- Key: SPARK-1478 URL: https://issues.apache.org/jira/browse/SPARK-1478 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor Fix For: 1.1.0 Flume-1915 added support for compression over the wire from avro sink to avro source. I would like to add this functionality to the FlumeReceiver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2419: - Description: This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. was: This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot - Ordering and parallelism of the output operations - Receiver's should be serializable Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2437) Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES
Patrick Wendell created SPARK-2437: -- Summary: Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES Key: SPARK-2437 URL: https://issues.apache.org/jira/browse/SPARK-2437 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Right now calling it MAVEN_PROFILES is confusing because it's actually only used by sbt and _not_ maven! Also, we should have a similar SBT_MAVEN_PROPERTIES. In some cases people will want to set these automatically if they are e.g. building Hadoop for a specific minor version. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1785) Streaming requires receivers to be serializable
[ https://issues.apache.org/jira/browse/SPARK-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-1785. Resolution: Fixed Streaming requires receivers to be serializable --- Key: SPARK-1785 URL: https://issues.apache.org/jira/browse/SPARK-1785 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: Hari Shreedharan When the ReceiverTracker starts the receivers it creates a temporary RDD to send the receivers over to the workers. Then they are started on the workers using a the startReceivers method. Looks like this means that the receivers have to really be serializable. In case of the Flume receiver, the Avro IPC components are not serializable causing an error that looks like this: {code} Exception in thread Thread-46 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.avro.ipc.specific.SpecificResponder - field (class org.apache.spark.streaming.flume.FlumeReceiver, name: responder, type: class org.apache.avro.ipc.specific.SpecificResponder) - object (class org.apache.spark.streaming.flume.FlumeReceiver, org.apache.spark.streaming.flume.FlumeReceiver@5e6bbb36) - element of array (index: 0) - array (class [Lorg.apache.spark.streaming.receiver.Receiver;, size: 1) - field (class scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;) - object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(org.apache.spark.streaming.flume.FlumeReceiver@5e6bbb36)) - field (class org.apache.spark.rdd.ParallelCollectionPartition, name: values, type: interface scala.collection.Seq) - custom writeObject data (class org.apache.spark.rdd.ParallelCollectionPartition) - object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691) - writeExternal data - root object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0)) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} A way out of this is to simply send the class name (or .class) to the workers in the tempRDD and have the workers instantiate and start the receiver. My analysis maybe wrong. but if it makes sense, I will submit a PR to fix this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2157) Can't write tight firewall rules for Spark
[ https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2157: --- Assignee: Andrew Ash Can't write tight firewall rules for Spark -- Key: SPARK-2157 URL: https://issues.apache.org/jira/browse/SPARK-2157 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Priority: Critical In order to run Spark in places with strict firewall rules, you need to be able to specify every port that's used between all parts of the stack. Per the [network activity section of the docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security] most of the ports are configurable, but there are a few ports that aren't configurable. We need to make every port configurable to a particular port, so that we can run Spark in highly locked-down environments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2157) Can't write tight firewall rules for Spark
[ https://issues.apache.org/jira/browse/SPARK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2157: --- Target Version/s: 1.1.0 Can't write tight firewall rules for Spark -- Key: SPARK-2157 URL: https://issues.apache.org/jira/browse/SPARK-2157 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Priority: Critical In order to run Spark in places with strict firewall rules, you need to be able to specify every port that's used between all parts of the stack. Per the [network activity section of the docs|http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security] most of the ports are configurable, but there are a few ports that aren't configurable. We need to make every port configurable to a particular port, so that we can run Spark in highly locked-down environments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2438) Streaming + MLLib
Jeremy Freeman created SPARK-2438: - Summary: Streaming + MLLib Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jeremy Freeman This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057954#comment-14057954 ] Hari Shreedharan commented on SPARK-2345: - If I have to write the data out as say Avro or any other format, there is no easy way to do this. saveAsHadoop currently does only Sequence Files AFAIK. Adding support for a a dstream that can run the function on executors without the implementor worrying about it would enable this. ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2439) Actually close FileSystem in Master
Andrew Or created SPARK-2439: Summary: Actually close FileSystem in Master Key: SPARK-2439 URL: https://issues.apache.org/jira/browse/SPARK-2439 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Priority: Minor Fix For: 1.1.0 This was intended, but never actually materialized in code... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2411: - Attachment: (was: Screen Shot 2014-07-08 at 4.23.51 PM.png) Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2411: - Attachment: Master event logs.png Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Attachments: Master event logs.png Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2440) Enable HistoryServer to display lots of Application History
Kousuke Saruta created SPARK-2440: - Summary: Enable HistoryServer to display lots of Application History Key: SPARK-2440 URL: https://issues.apache.org/jira/browse/SPARK-2440 Project: Spark Issue Type: Improvement Reporter: Kousuke Saruta Fix For: 1.0.0 In current implementation of HistoryServer, it can display 250 records by default. Sometimes we'd like to see over 250 records and configure to be able to list more records, but current implementation lists all the records just in one page. This is not useful. And to make matters worse, initial launch of HistoryServer is very slowly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-944: Assignee: Ted Malaska (was: Tathagata Das) Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Ted Malaska Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred
[ https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-1667: -- Comment: was deleted (was: Now I'm trying to address this issue.) Jobs never finish successfully once bucket file missing occurred Key: SPARK-1667 URL: https://issues.apache.org/jira/browse/SPARK-1667 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.0.0 Reporter: Kousuke Saruta If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*). When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred
[ https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-1667: -- Description: If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*). When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files. was: I met a case that re-fetch wouldn't occur although that should occur. When intermediate data (phisical file of intermediate data on local file system) which is used for shuffle is lost from a Executor, FileNotFoundException was thrown and refetch wouldn't occur. Jobs never finish successfully once bucket file missing occurred Key: SPARK-1667 URL: https://issues.apache.org/jira/browse/SPARK-1667 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.0.0 Reporter: Kousuke Saruta If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*). When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2427) Fix some of the Scala examples
[ https://issues.apache.org/jira/browse/SPARK-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2427. Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Fix some of the Scala examples -- Key: SPARK-2427 URL: https://issues.apache.org/jira/browse/SPARK-2427 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.0.0 Reporter: Constantin Ahlmann Fix For: 1.1.0, 1.0.2 The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1341) Ability to control the data rate in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058091#comment-14058091 ] Tathagata Das commented on SPARK-1341: -- Thanks Isaac. This has been merged, as is going to be very useful! Ability to control the data rate in Spark Streaming Key: SPARK-1341 URL: https://issues.apache.org/jira/browse/SPARK-1341 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Fix For: 1.1.0, 1.0.2 To be able to control the rate at which data is received through Spark Streaming's receivers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1341) Ability to control the data rate in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058091#comment-14058091 ] Tathagata Das edited comment on SPARK-1341 at 7/10/14 11:05 PM: Thanks Isaac. This has been merged, and is going to be very useful! was (Author: tdas): Thanks Isaac. This has been merged, as is going to be very useful! Ability to control the data rate in Spark Streaming Key: SPARK-1341 URL: https://issues.apache.org/jira/browse/SPARK-1341 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Fix For: 1.1.0, 1.0.2 To be able to control the rate at which data is received through Spark Streaming's receivers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2442) Add a Hadoop Writable serializer
Hari Shreedharan created SPARK-2442: --- Summary: Add a Hadoop Writable serializer Key: SPARK-2442 URL: https://issues.apache.org/jira/browse/SPARK-2442 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Using data read from hadoop files in shuffles can cause exceptions with the following stacktrace: {code} java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) {code} This though seems to go away if Kyro serializer is used. I am wondering if adding a Hadoop-writables friendly serializer makes sense as it is likely to perform better than Kyro without registration, since Writables don't implement Serializable - so the serialization might not be the most efficient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058151#comment-14058151 ] Mubarak Seyed commented on SPARK-1853: -- [~tdas] Seems like some bug in {code}Utils#getCallSite{code} Testing a fix. Will update you shortly. Thanks Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.1.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058159#comment-14058159 ] Tathagata Das commented on SPARK-1853: -- I dont think it is a bug in that. This is an artifact of the Spark Streaming's execution model -- all the DStreams are initially defined in one thread, and then RDDs and jobs are created and executed them in another thread. So when the thread creating RDDs calls Utils#getCallSite (whic looks at the stack trace to figure out the user code that in the call stack), it cannot find any reference to the user program as it is not called from the user program. The correct solution will probably require each DStream's to store its own call site info (where they were defined), and set it explicitly on every RDD that gets created by it. Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.1.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057824#comment-14057824 ] Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:39 AM: I must be missing something. If you want to run a Spark job inside the DStream.foreach function, you can just call any RDD action (count, collect, first, saveAsFile, etc.) inside that foreach function. context.runJob does not need to be called *explicitly*. All of the exisinting RDD actions (including the most generic rdd.foreachPartition) should be mostly sufficient for all requirements. Maybe adding a code example of what you intend to do will help us disambiguate this? was (Author: tdas): I must be missing something. If you want to run a Spark job inside the DStream.foreach function, you can just call any RDD action (count, collect, first, saveAs***File, etc.) inside that foreach function. context.runJob does not need to be called *explicitly*. All of the exisinting RDD actions (including the most generic rdd.foreachPartition) should be mostly sufficient for all requirements. Maybe adding a code example of what you intend to do will help us disambiguate this? ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165 ] Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:44 AM: Then what I suggested earlier makes sense, doesnt it? DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) This will effectively rdd.foreachPartition(function) on every RDD generated by the DStream. RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary OutputFormat that works with HDFS API. was (Author: tdas): Then what I suggested earlier makes sense, doesnt it? DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) This will effectively rdd.foreachPartition(function) on every RDD generated by the DStream. RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary OutputFormat that works with HDFS API. ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165 ] Tathagata Das commented on SPARK-2345: -- Then what I suggested earlier makes sense, doesnt it? DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) This will effectively rdd.foreachPartition(function) on every RDD generated by the DStream. RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary OutputFormat that works with HDFS API. ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058165#comment-14058165 ] Tathagata Das edited comment on SPARK-2345 at 7/11/14 12:45 AM: Then what I suggested earlier makes sense, doesnt it? DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) This will effectively execute rdd.foreachPartition(function) on every RDD generated by the DStream. RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary OutputFormat that works with HDFS API. was (Author: tdas): Then what I suggested earlier makes sense, doesnt it? DStream.foreachRDDPartition[T](function: Iterator[T] = Unit) This will effectively rdd.foreachPartition(function) on every RDD generated by the DStream. RDD.saveAsHadoopFile / RDD.newAPIHadoopFile should support any arbitrary OutputFormat that works with HDFS API. ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2443) Reading from Partitioned Tables is Slow
[ https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2443: Description: Here are some numbers, all queries return ~20million: {code} SELECT COUNT(*) FROM non partitioned table 5.496467726 s SELECT COUNT(*) FROM partitioned table stored in parquet 50.26947 s SELECT COUNT(*) FROM same table as previous but loaded with parquetFile instead of through hive 2s {code} was: Here are some numbers, all queries return ~20million: SELECT COUNT(*) FROM non partitioned table 5.496467726 s SELECT COUNT(*) FROM partitioned table stored in parquet 50.26947 s SELECT COUNT(*) FROM same table as previous but loaded with parquetFile instead of through hive 2s Reading from Partitioned Tables is Slow --- Key: SPARK-2443 URL: https://issues.apache.org/jira/browse/SPARK-2443 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Here are some numbers, all queries return ~20million: {code} SELECT COUNT(*) FROM non partitioned table 5.496467726 s SELECT COUNT(*) FROM partitioned table stored in parquet 50.26947 s SELECT COUNT(*) FROM same table as previous but loaded with parquetFile instead of through hive 2s {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2298) Show stage attempt in UI
[ https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058177#comment-14058177 ] Masayoshi TSUZUKI commented on SPARK-2298: -- Thank you for your reply. Yes, I'm interested in it! Show stage attempt in UI Key: SPARK-2298 URL: https://issues.apache.org/jira/browse/SPARK-2298 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Reynold Xin Assignee: Andrew Or Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png We should add a column to the web ui to show stage attempt id. Then tasks should be grouped by (stageId, stageAttempt) tuple. When a stage is resubmitted (e.g. due to fetch failures), we should get a different entry in the web ui and tasks for the resubmission go there. See the attached screenshot for the confusing status quo. We currently show the same stage entry twice, and then tasks appear in both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058184#comment-14058184 ] Ted Malaska commented on SPARK-944: --- Thank you for the assign. I will start working on this tomorrow. We will review our approach to HBase and Spark and I will write the patch over the weekend. Thanks again. Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Ted Malaska Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058195#comment-14058195 ] Mubarak Seyed commented on SPARK-1853: -- I just added _streaming.dstream_ in regex and it filters the DStream code context but still shows other calls such as _apply at Option.scala, apply at List.scala_ {code} private val SPARK_CLASS_REGEX = ^org\.apache\.spark(\.api\.java)?(\.util)?(\.rdd)?(\.streaming\.dstream)?\.[A-Z].r {code} You are correct as we can't just keep adding regex for internal code. I will start work on the solution as you advised. Thanks Show Streaming application code context (file, line number) in Spark Stages UI -- Key: SPARK-1853 URL: https://issues.apache.org/jira/browse/SPARK-1853 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Mubarak Seyed Fix For: 1.1.0 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png Right now, the code context (file, and line number) shown for streaming jobs in stages UI is meaningless as it refers to internal DStream:random line rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2444) Make spark.yarn.executor.memoryOverhead a first class citizen
Nishkam Ravi created SPARK-2444: --- Summary: Make spark.yarn.executor.memoryOverhead a first class citizen Key: SPARK-2444 URL: https://issues.apache.org/jira/browse/SPARK-2444 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Reporter: Nishkam Ravi Higher value of spark.yarn.executor.memoryOverhead is critical to running Spark applications on Yarn (https://issues.apache.org/jira/browse/SPARK-2398) at least for 1.0. It would be great to have this parameter highlighted in the docs/usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.
[ https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2415. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 RowWriteSupport should handle empty ArrayType correctly. Key: SPARK-2415 URL: https://issues.apache.org/jira/browse/SPARK-2415 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read value becomes {{null}}. It should write empty {{ArrayType}} value as it is. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2428) Add except and intersect methods to SchemaRDD.
[ https://issues.apache.org/jira/browse/SPARK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2428. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin Add except and intersect methods to SchemaRDD. -- Key: SPARK-2428 URL: https://issues.apache.org/jira/browse/SPARK-2428 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298 ] Guoqiang Li edited comment on SPARK-2398 at 7/11/14 3:16 AM: - Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to {{false}} . Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] was (Author: gq): Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to false. Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] Trouble running Spark 1.0 on Yarn -- Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298 ] Guoqiang Li commented on SPARK-2398: Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to false. Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] Trouble running Spark 1.0 on Yarn -- Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at
[jira] [Comment Edited] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058298#comment-14058298 ] Guoqiang Li edited comment on SPARK-2398 at 7/11/14 3:17 AM: - Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to {{false}} . Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] was (Author: gq): Q1. {{-Xmx}} is only the heap space, {{native library}} and {{sun.misc.Unsafe}} can easily allocate memory outside Java heap. reference http://stackoverflow.com/questions/6527131/java-using-more-memory-than-the-allocated-memory. Q2. This is not a bug. We can disable this check by setting {{yarn.nodemanager.pmem-check-enabled}} to {{false}} . Its default value is {{true}} in [yarn-default.xml|http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml] Trouble running Spark 1.0 on Yarn -- Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at
[jira] [Comment Edited] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058322#comment-14058322 ] Aaron edited comment on SPARK-911 at 7/11/14 3:50 AM: -- I decided to take a look at this myself (disclaimer haven't worked on sparks source code before). A couple things I noticed were: 1. since compute is called per split for all RDD's, the best I could figure out using that approach is to just return an empty iterator, not sure if that creates a weird partition situation 2. you could more efficiently check the partitions that could contain correct numbers, but with only an iterator you have to linearly search was (Author: aaronjosephs): I decided to take a look at this myself (disclaimer haven't worked on sparks source code before). A couple things I noticed were: 1. since compute is called per split for all RDD's, the best I could figure out using that approach is to just return an empty iterator 2. you could more efficiently check the partitions that could contain correct numbers, but with only an iterator you have to linearly search Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Reporter: Patrick Wendell If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058322#comment-14058322 ] Aaron commented on SPARK-911: - I decided to take a look at this myself (disclaimer haven't worked on sparks source code before). A couple things I noticed were: 1. since compute is called per split for all RDD's, the best I could figure out using that approach is to just return an empty iterator 2. you could more efficiently check the partitions that could contain correct numbers, but with only an iterator you have to linearly search Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Reporter: Patrick Wendell If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build
[ https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2358. Resolution: Fixed Fix Version/s: 1.1.0 Add an option to include native BLAS/LAPACK loader in the build --- Key: SPARK-2358 URL: https://issues.apache.org/jira/browse/SPARK-2358 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which is enabled by turning on SPARK_GANGLIA_LGPL at build time. We can use SPARK_NETLIB_LGPL flag for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2150) Provide direct link to finished application UI in yarn resource manager UI
[ https://issues.apache.org/jira/browse/SPARK-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058397#comment-14058397 ] Masayoshi TSUZUKI commented on SPARK-2150: -- Is someone working on this ticket? Pull request seems not to be processed. Provide direct link to finished application UI in yarn resource manager UI -- Key: SPARK-2150 URL: https://issues.apache.org/jira/browse/SPARK-2150 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Rahul Singhal Priority: Minor Currently the link that is provide as the tracking URL for a finished application in yarn resource manager UI is of the Spark history server home page. We should provide a direct link to the application UI so that the user does not have to figure out the correspondence between yarn application ID and the link on the Spark history server home page. -- This message was sent by Atlassian JIRA (v6.2#6252)