[jira] [Updated] (SPARK-7880) Silent failure if assembly jar is corrupted
[ https://issues.apache.org/jira/browse/SPARK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7880: - Target Version/s: 1.3.2, 1.4.2, 1.5.0 (was: 1.3.2, 1.4.1, 1.5.0) > Silent failure if assembly jar is corrupted > --- > > Key: SPARK-7880 > URL: https://issues.apache.org/jira/browse/SPARK-7880 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Andrew Or > > If you try to run `bin/spark-submit` with a corrupted jar, you get no output > and your application does not run. We should have an informative message that > indicates the failure to open the jar instead of silently swallowing it. > This is caused by this line: > https://github.com/apache/spark/blob/61664732b25b35f94be35a42cde651cbfd0e02b7/bin/spark-class#L75 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5905) Improve RowMatrix user guide and doc.
[ https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5905: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Improve RowMatrix user guide and doc. > - > > Key: SPARK-5905 > URL: https://issues.apache.org/jira/browse/SPARK-5905 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Priority: Minor > > From mbofb's comment in PR https://github.com/apache/spark/pull/4680: > {code} > The description of RowMatrix.computeSVD and > mllib-dimensionality-reduction.html should be more precise/explicit regarding > the m x n matrix. In the current description I would conclude that n refers > to the rows. According to > http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix > this way of describing a matrix is only used in particular domains. I as a > reader interested on applying SVD would rather prefer the more common m x n > way of rows x columns (e.g. > http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in > http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the > ARPACK manual: > “ > N Integer. (INPUT) - Dimension of the eigenproblem. > NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < > N. > NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal > to N). > “ > ). > description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html: > "We assume n is smaller than m." Is this just a recommendation or a hard > requirement. This condition seems not to be checked and causing an > IllegalArgumentException – the processing finishes even though the vectors > have a higher dimension than the number of vectors. > description of RowMatrix. computePrincipalComponents or RowMatrix in general: > I got a Exception. > java.lang.IllegalArgumentException: Argument with more than 65535 cols: > 7949273 > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318) > at > org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373) > This 65535 cols restriction would be nice to be written in the doc (if this > still applies in 1.3). > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6174) Improve doc: Python ALS, MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6174: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Improve doc: Python ALS, MatrixFactorizationModel > - > > Key: SPARK-6174 > URL: https://issues.apache.org/jira/browse/SPARK-6174 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > The Python docs for recommendation have almost no content except an example. > Add class, method & attribute descriptions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6129) Add a section in user guide for model evaluation
[ https://issues.apache.org/jira/browse/SPARK-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6129: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Add a section in user guide for model evaluation > > > Key: SPARK-6129 > URL: https://issues.apache.org/jira/browse/SPARK-6129 > Project: Spark > Issue Type: New Feature > Components: Documentation, MLlib >Reporter: Xiangrui Meng > > We now have evaluation metrics for binary, multiclass, ranking, and > multilabel in MLlib. It would be nice to have a section in the user guide to > summarize them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6266: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > PySpark SparseVector missing doc for size, indices, values > -- > > Key: SPARK-6266 > URL: https://issues.apache.org/jira/browse/SPARK-6266 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8016) YARN cluster / client modes have different app names for python
[ https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8016: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > YARN cluster / client modes have different app names for python > --- > > Key: SPARK-8016 > URL: https://issues.apache.org/jira/browse/SPARK-8016 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.4.0 >Reporter: Andrew Or >Priority: Minor > Attachments: python.png > > > See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8050) Make Savable and Loader Java-friendly.
[ https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8050: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Make Savable and Loader Java-friendly. > -- > > Key: SPARK-8050 > URL: https://issues.apache.org/jira/browse/SPARK-8050 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0, 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > Should overload save/load to accept JavaSparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8807) Add between operator in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614688#comment-14614688 ] Venkata Vineel commented on SPARK-8807: --- [~yu_ishikawa] Can you please add more details on this. I would like to work on this issue. Please consider. > Add between operator in SparkR > -- > > Key: SPARK-8807 > URL: https://issues.apache.org/jira/browse/SPARK-8807 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yu Ishikawa > > Add between operator in SparkR > ``` > df$age between c(1, 2) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8400) ml.ALS doesn't handle -1 block size
[ https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8400: - Target Version/s: 1.3.2, 1.4.2, 1.5.0 (was: 1.3.2, 1.4.1, 1.5.0) > ml.ALS doesn't handle -1 block size > --- > > Key: SPARK-8400 > URL: https://issues.apache.org/jira/browse/SPARK-8400 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.3.1 >Reporter: Xiangrui Meng > > Under spark.mllib, if number blocks is set to -1, we set the block size > automatically based on the input partition size. However, this behavior is > not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not > work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8390) Update DirectKafkaWordCount examples to show how offset ranges can be used
[ https://issues.apache.org/jira/browse/SPARK-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8390: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Update DirectKafkaWordCount examples to show how offset ranges can be used > -- > > Key: SPARK-8390 > URL: https://issues.apache.org/jira/browse/SPARK-8390 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.4.0 >Reporter: Tathagata Das >Assignee: Cody Koeninger > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8593) History Server doesn't show complete application when one attempt inprogress
[ https://issues.apache.org/jira/browse/SPARK-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8593: - Target Version/s: (was: 1.4.1) > History Server doesn't show complete application when one attempt inprogress > > > Key: SPARK-8593 > URL: https://issues.apache.org/jira/browse/SPARK-8593 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 >Reporter: Thomas Graves > > The Spark history server doesn't show an application if the first attempt of > the application is still inprogress. > Here are the files in hdfs: > -rwxrwx--- 3 tgraves hdfs234 2015-06-24 15:49 > sparkhistory/application_1433751980223_18926_1.inprogress > -rwxrwx--- 3 tgraves hdfs9609450 2015-06-24 15:51 > sparkhistory/application_1433751980223_18926_2 > The UI shows them if I set the showIncomplete=true. > Removing the inprogress file allows it to show up when showIncomplete is > false. > It should be smart enough to atleast show the second successful attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8414) Ensure ClosureCleaner actually triggers clean ups
[ https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8414: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) > Ensure ClosureCleaner actually triggers clean ups > - > > Key: SPARK-8414 > URL: https://issues.apache.org/jira/browse/SPARK-8414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Right now it cleans up old references only through natural GCs, which may not > occur if the driver has infinite RAM. We should do a periodic GC to make sure > that we actually do clean things up. Something like once per 30 minutes seems > relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8828) Revert the change of SPARK-5680
[ https://issues.apache.org/jira/browse/SPARK-8828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8828: - Component/s: SQL > Revert the change of SPARK-5680 > --- > > Key: SPARK-8828 > URL: https://issues.apache.org/jira/browse/SPARK-8828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.0 >Reporter: Yin Huai >Priority: Critical > > SPARK-5680 introduced a bug to sum function. After this change, when all > input values are nulls, it returns 0.0 instead of null, which is wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8747) fix EqualNullSafe for binary type
[ https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8747: - Assignee: Wenchen Fan > fix EqualNullSafe for binary type > - > > Key: SPARK-8747 > URL: https://issues.apache.org/jira/browse/SPARK-8747 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors
[ https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7401: - Assignee: Manoj Kumar > Dot product and squared_distances should be vectorized in Vectors > - > > Key: SPARK-7401 > URL: https://issues.apache.org/jira/browse/SPARK-7401 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Manoj Kumar >Assignee: Manoj Kumar > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614682#comment-14614682 ] Peter Prettenhofer commented on SPARK-5133: --- [~yalamart] I'm already working on it -- havent published a PR yet > Feature Importance for Decision Tree (Ensembles) > > > Key: SPARK-5133 > URL: https://issues.apache.org/jira/browse/SPARK-5133 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Peter Prettenhofer > Original Estimate: 168h > Remaining Estimate: 168h > > Add feature importance to decision tree model and tree ensemble models. > If people are interested in this feature I could implement it given a mentor > (API decisions, etc). Please find a description of the feature below: > Decision trees intrinsically perform feature selection by selecting > appropriate split points. This information can be used to assess the relative > importance of a feature. > Relative feature importance gives valuable insight into a decision tree or > tree ensemble and can even be used for feature selection. > More information on feature importance (via decrease in impurity) can be > found in ESLII (10.13.1) or here [1]. > R's randomForest package uses a different technique for assessing variable > importance that is based on permutation tests. > All necessary information to create relative importance scores should be > available in the tree representation (class Node; split, impurity gain, > (weighted) nr of samples?). > [1] > http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614681#comment-14614681 ] Tathagata Das commented on SPARK-8743: -- [~neelesh77] Any ETA on this? > Deregister Codahale metrics for streaming when StreamingContext is closed > -- > > Key: SPARK-8743 > URL: https://issues.apache.org/jira/browse/SPARK-8743 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Neelesh Srinivas Salian > Labels: starter > > Currently, when the StreamingContext is closed, the registered metrics are > not deregistered. If another streaming context is started, it throws a > warning saying that the metrics are already registered. > The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8788) Java unit test for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8788: - Affects Version/s: (was: 1.5.0) Priority: Minor (was: Major) [~yanboliang] please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and set JIRA fields more carefully. This can't affect version 1.5, which doesn't exist, and is not Major. > Java unit test for PCA transformer > -- > > Key: SPARK-8788 > URL: https://issues.apache.org/jira/browse/SPARK-8788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Add Java unit test for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8833) Kafka Direct API support offset in zookeeper
[ https://issues.apache.org/jira/browse/SPARK-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8833: --- Assignee: (was: Apache Spark) > Kafka Direct API support offset in zookeeper > > > Key: SPARK-8833 > URL: https://issues.apache.org/jira/browse/SPARK-8833 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.0 >Reporter: guowei > > Kafka Direct API only support consume the topic from latest or earliest. > but user usually need to consume message from last offset when restart > stream app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8833) Kafka Direct API support offset in zookeeper
[ https://issues.apache.org/jira/browse/SPARK-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8833: --- Assignee: Apache Spark > Kafka Direct API support offset in zookeeper > > > Key: SPARK-8833 > URL: https://issues.apache.org/jira/browse/SPARK-8833 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.0 >Reporter: guowei >Assignee: Apache Spark > > Kafka Direct API only support consume the topic from latest or earliest. > but user usually need to consume message from last offset when restart > stream app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8833) Kafka Direct API support offset in zookeeper
[ https://issues.apache.org/jira/browse/SPARK-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614677#comment-14614677 ] Apache Spark commented on SPARK-8833: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/7235 > Kafka Direct API support offset in zookeeper > > > Key: SPARK-8833 > URL: https://issues.apache.org/jira/browse/SPARK-8833 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.0 >Reporter: guowei > > Kafka Direct API only support consume the topic from latest or earliest. > but user usually need to consume message from last offset when restart > stream app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614673#comment-14614673 ] Sean Owen commented on SPARK-8646: -- [~j_houg] is the resolution here just that pandas has to be installed if pandas is used? > PySpark does not run on YARN > > > Key: SPARK-8646 > URL: https://issues.apache.org/jira/browse/SPARK-8646 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.4.0 > Environment: SPARK_HOME=local/path/to/spark1.4install/dir > also with > SPARK_HOME=local/path/to/spark1.4install/dir > PYTHONPATH=$SPARK_HOME/python/lib > Spark apps are submitted with the command: > $SPARK_HOME/bin/spark-submit outofstock/data_transform.py > hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client > data_transform contains a main method, and the rest of the args are parsed in > my own code. >Reporter: Juliet Hougland > Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, > spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, > spark1.4-SPARK_HOME-set.log > > > Running pyspark jobs result in a "no module named pyspark" when run in > yarn-client mode in spark 1.4. > [I believe this JIRA represents the change that introduced this error.| > https://issues.apache.org/jira/browse/SPARK-6869 ] > This does not represent a binary compatible change to spark. Scripts that > worked on previous spark versions (ie comands the use spark-submit) should > continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8833) Kafka Direct API support offset in zookeeper
[ https://issues.apache.org/jira/browse/SPARK-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614670#comment-14614670 ] Sean Owen commented on SPARK-8833: -- No, you actually pass the offsets you want to begin consuming at. Are you looking at {{createDirectStream}}? it's {{fromOffsets}}. > Kafka Direct API support offset in zookeeper > > > Key: SPARK-8833 > URL: https://issues.apache.org/jira/browse/SPARK-8833 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.0 >Reporter: guowei > > Kafka Direct API only support consume the topic from latest or earliest. > but user usually need to consume message from last offset when restart > stream app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8833) Kafka Direct API support offset in zookeeper
guowei created SPARK-8833: - Summary: Kafka Direct API support offset in zookeeper Key: SPARK-8833 URL: https://issues.apache.org/jira/browse/SPARK-8833 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: guowei Kafka Direct API only support consume the topic from latest or earliest. but user usually need to consume message from last offset when restart stream app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614645#comment-14614645 ] Santiago M. Mola commented on SPARK-6981: - Any progress on this? > [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext > > > Key: SPARK-6981 > URL: https://issues.apache.org/jira/browse/SPARK-6981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.4.0 >Reporter: Edoardo Vacchi >Priority: Minor > > In order to simplify extensibility with new strategies from third-parties, it > should be better to factor SparkPlanner and QueryExecution in their own > classes. Dependent types add additional, unnecessary complexity; besides, > HiveContext would benefit from this change as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8832) insertInto() throws error in sparkR
Amar Gondaliya created SPARK-8832: - Summary: insertInto() throws error in sparkR Key: SPARK-8832 URL: https://issues.apache.org/jira/browse/SPARK-8832 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Amar Gondaliya insertInto() is not working. It throws AssertionError. Trying to insert record from one dataframe to another dataframe. df1 <- generated from the other dataframe after applying group by aggregation(columnNames : "item","frequency") registerTempTable(df1,"df") df2 <-generated from the other dataframe after applying group by aggregation(columnNames : "item","frequency") insertInto(df2,"df",overwrite=T) this throws assertion error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8818) In should not take Any not Column
[ https://issues.apache.org/jira/browse/SPARK-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614635#comment-14614635 ] Yu Ishikawa edited comment on SPARK-8818 at 7/6/15 7:08 AM: [~marmbrus] Is what you want to do like that? https://issues.apache.org/jira/browse/SPARK-8348 was (Author: yuu.ishik...@gmail.com): [~marmbrus] What you want to do is like that? https://issues.apache.org/jira/browse/SPARK-8348 > In should not take Any not Column > - > > Key: SPARK-8818 > URL: https://issues.apache.org/jira/browse/SPARK-8818 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > This is pretty verbose having to write {{lit(...)}} > {code} > .where('timestamp in (lit(1435897619640L), lit(1435924856812L))) > {code} > I think i most cases people using in will be listing static values, not > columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8818) In should not take Any not Column
[ https://issues.apache.org/jira/browse/SPARK-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614635#comment-14614635 ] Yu Ishikawa commented on SPARK-8818: [~marmbrus] What you want to do is like that? https://issues.apache.org/jira/browse/SPARK-8348 > In should not take Any not Column > - > > Key: SPARK-8818 > URL: https://issues.apache.org/jira/browse/SPARK-8818 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > This is pretty verbose having to write {{lit(...)}} > {code} > .where('timestamp in (lit(1435897619640L), lit(1435924856812L))) > {code} > I think i most cases people using in will be listing static values, not > columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614630#comment-14614630 ] Venkata Vineel commented on SPARK-8540: --- [~josephkb] Can I please work on this(if you can mentor me with design etc.). > KMeans-based outlier detection > -- > > Key: SPARK-8540 > URL: https://issues.apache.org/jira/browse/SPARK-8540 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > Original Estimate: 336h > Remaining Estimate: 336h > > Proposal for K-Means-based outlier detection: > * Cluster data using K-Means > * Provide prediction/filtering functionality which returns outliers/anomalies > ** This can take some threshold parameter which specifies either (a) how far > off a point needs to be to be considered an outlier or (b) how many outliers > should be returned. > Note this will require a bit of API design, which should probably be posted > and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614629#comment-14614629 ] Venkata Vineel commented on SPARK-6885: --- [~josephkb] Can I work on this. Can you please assign this to me ? > Decision trees: predict class probabilities > --- > > Key: SPARK-6885 > URL: https://issues.apache.org/jira/browse/SPARK-6885 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Under spark.ml, have DecisionTreeClassifier (currently being added) extend > ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614627#comment-14614627 ] Santiago M. Mola commented on SPARK-8636: - [~davies] NULL values are grouped together when using a GROUP BY clause. See https://en.wikipedia.org/wiki/Null_%28SQL%29#When_two_nulls_are_equal:_grouping.2C_sorting.2C_and_some_set_operations {quote} Because SQL:2003 defines all Null markers as being unequal to one another, a special definition was required in order to group Nulls together when performing certain operations. SQL defines "any two values that are equal to one another, or any two Nulls", as "not distinct". This definition of not distinct allows SQL to group and sort Nulls when the GROUP BY clause (and other keywords that perform grouping) are used. {quote} > CaseKeyWhen has incorrect NULL handling > --- > > Key: SPARK-8636 > URL: https://issues.apache.org/jira/browse/SPARK-8636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Santiago M. Mola > Labels: starter > > CaseKeyWhen implementation in Spark uses the following equals implementation: > {code} > private def equalNullSafe(l: Any, r: Any) = { > if (l == null && r == null) { > true > } else if (l == null || r == null) { > false > } else { > l == r > } > } > {code} > Which is not correct, since in SQL, NULL is never equal to NULL (actually, it > is not unequal either). In this case, a NULL value in a CASE WHEN expression > should never match. > For example, you can execute this in MySQL: > {code} > SELECT CASE NULL WHEN NULL THEN "NULL MATCHES" ELSE "NULL DOES NOT MATCH" END > FROM DUAL; > {code} > And the result will be "NULL DOES NOT MATCH". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org