[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3410: -- Summary: The priority of shutdownhook for ApplicationMaster should not be integer literal (was: The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.) > The priority of shutdownhook for ApplicationMaster should not be integer > literal > > > Key: SPARK-3410 > URL: https://issues.apache.org/jira/browse/SPARK-3410 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > In ApplicationMaster, the priority of shutdown hook is set to 30, which > expects higher than the priority of o.a.h.FileSystem. > In FileSystem, the priority of shutdown hook is expressed as public constant > named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant > for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3411) Optimize the schedule procedure in Master
WangTaoTheTonic created SPARK-3411: -- Summary: Optimize the schedule procedure in Master Key: SPARK-3411 URL: https://issues.apache.org/jira/browse/SPARK-3411 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization. We should do randomization every time we dispatch a driver, in order to better balance drivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122465#comment-14122465 ] Apache Spark commented on SPARK-3410: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2283 > The priority of shutdownhook for ApplicationMaster should not be integer > literal, rather than refer constant. > - > > Key: SPARK-3410 > URL: https://issues.apache.org/jira/browse/SPARK-3410 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > In ApplicationMaster, the priority of shutdown hook is set to 30, which > expects higher than the priority of o.a.h.FileSystem. > In FileSystem, the priority of shutdown hook is expressed as public constant > named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant > for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.
[ https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3410: -- Issue Type: Improvement (was: Bug) > The priority of shutdownhook for ApplicationMaster should not be integer > literal, rather than refer constant. > - > > Key: SPARK-3410 > URL: https://issues.apache.org/jira/browse/SPARK-3410 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > In ApplicationMaster, the priority of shutdown hook is set to 30, which > expects higher than the priority of o.a.h.FileSystem. > In FileSystem, the priority of shutdown hook is expressed as public constant > named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant > for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.
Kousuke Saruta created SPARK-3410: - Summary: The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant. Key: SPARK-3410 URL: https://issues.apache.org/jira/browse/SPARK-3410 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In ApplicationMaster, the priority of shutdown hook is set to 30, which expects higher than the priority of o.a.h.FileSystem. In FileSystem, the priority of shutdown hook is expressed as public constant named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant for the priority of ApplicationMaster's shutdown hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures
[ https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122408#comment-14122408 ] Apache Spark commented on SPARK-3409: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2282 > Avoid pulling in Exchange operator itself in Exchange's closures > > > Key: SPARK-3409 > URL: https://issues.apache.org/jira/browse/SPARK-3409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > {code} > val rdd = child.execute().mapPartitions { iter => > if (sortBasedShuffleOn) { > iter.map(r => (null, r.copy())) > } else { > val mutablePair = new MutablePair[Null, Row]() > iter.map(r => mutablePair.update(null, r)) > } > } > {code} > The above snippet from Exchange references sortBasedShuffleOn within a > closure, which requires pulling in the entire Exchange object in the closure. > This is a tiny teeny optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3408) Limit operator doesn't work with sort based shuffle
[ https://issues.apache.org/jira/browse/SPARK-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122407#comment-14122407 ] Apache Spark commented on SPARK-3408: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2281 > Limit operator doesn't work with sort based shuffle > --- > > Key: SPARK-3408 > URL: https://issues.apache.org/jira/browse/SPARK-3408 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122390#comment-14122390 ] sam commented on SPARK-1473: [~dmm...@gmail.com] mentioning also (i cant work which david is the one that posted above) > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122386#comment-14122386 ] sam commented on SPARK-1473: Good paper, the theory is very solid. My only concern is that the paper does not explicitly tackle the problem of probability estimation for high dimensionality, which for sparse data will be even worse. It just touches on the problem, saying: "This in turn causes increasingly poor judgements for the in- clusion/exclusion of features. For precisely this reason, the research community have developed various low-dimensional approximations to (9). In the following sections, we will investigate the implicit statistical assumptions and empirical effects of these approximations" Those mentioned sections do not go into theoretical detail, and therefore I disagree that the paper provides a "single unified information theoretic framework for feature selection" as it basically leaves the problem of probability estimation to the readers choice, and merely suggests the reader assumes some level of independence between features in order to implement an algorithm. [~dmborque] Do you know of any literature that does approach the problem of probability estimation in an information theoretic and philosophically justified way?? Anyway despite my concerns, this paper is still by far the best treatment of feature selection I have seen. > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures
Reynold Xin created SPARK-3409: -- Summary: Avoid pulling in Exchange operator itself in Exchange's closures Key: SPARK-3409 URL: https://issues.apache.org/jira/browse/SPARK-3409 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin {code} val rdd = child.execute().mapPartitions { iter => if (sortBasedShuffleOn) { iter.map(r => (null, r.copy())) } else { val mutablePair = new MutablePair[Null, Row]() iter.map(r => mutablePair.update(null, r)) } } {code} The above snippet from Exchange references sortBasedShuffleOn within a closure, which requires pulling in the entire Exchange object in the closure. This is a tiny teeny optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3408) Limit operator doesn't work with sort based shuffle
Reynold Xin created SPARK-3408: -- Summary: Limit operator doesn't work with sort based shuffle Key: SPARK-3408 URL: https://issues.apache.org/jira/browse/SPARK-3408 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3392) Set command always get for key "mapred.reduce.tasks"
[ https://issues.apache.org/jira/browse/SPARK-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3392. - Resolution: Fixed Fix Version/s: 1.2.0 > Set command always get for key "mapred.reduce.tasks" > > > Key: SPARK-3392 > URL: https://issues.apache.org/jira/browse/SPARK-3392 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Trivial > Fix For: 1.2.0 > > > This is a tiny fix for getting the value of "mapred.reduce.tasks", which make > more sense for the hive user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122292#comment-14122292 ] Saisai Shao commented on SPARK-2926: Hi Matei, sorry for late response, I will test more scenarios with your notes, also factor out to see if some codes can be shared with ExternalSorter. Thanks a lot. > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > -- > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3407) Add Date type support
Cheng Hao created SPARK-3407: Summary: Add Date type support Key: SPARK-3407 URL: https://issues.apache.org/jira/browse/SPARK-3407 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122274#comment-14122274 ] Saisai Shao commented on SPARK-3129: Hi [~hshreedharan]], thanks for your reply, is this PR (https://github.com/apache/spark/pull/1195) the one you mentioned about storeReliably()? According to my knowledge, this API aims to store bunch of messages into BM directly to make it reliable, but for some receiver like Kafka, socket and others, data is injected one by one message, we can't call storeReliably() each time because of efficiency and throughput concern, so we need to store these data locally to some amount, and then flush to BM using storeReliably(). So I think data will potentially be lost as we store it locally. These days I thought about WAL things, IMHO i think WAL would be a better solution compared to blocked store API. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122273#comment-14122273 ] RJ Nowling commented on SPARK-2430: --- Hi Yu, The community had suggested looking into scikit-learn's API so that is a good idea. I am hesitant to make backwards-incompatible API changes, however, until we know the new API will be stable for a long time. I think it would be best to implement a few more clustering algorithms to get a clear idea of what is similar vs different before making a new API. May I suggest you work on SPARK-2966 / SPARK-2429 first? RJ > Standarized Clustering Algorithm API and Framework > -- > > Key: SPARK-2430 > URL: https://issues.apache.org/jira/browse/SPARK-2430 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Recently, there has been a chorus of voices on the mailing lists about adding > new clustering algorithms to MLlib. To support these additions, we should > develop a common framework and API to reduce code duplication and keep the > APIs consistent. > At the same time, we can also expand the current API to incorporate requested > features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122266#comment-14122266 ] RJ Nowling commented on SPARK-2966: --- No worries. Based on my reading of the Spark contribution guidelines ( https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I think that the Spark community would prefer to have one good implementation of an algorithm instead of multiple similar algorithms. Since the community has stated a clear preference for divisive hierarchical clustering, I think that is a better aim. You seem very motivated and have made some good contributions -- would you like to take the lead on the hierarchical clustering? I can review your code to help you improve it. That said, I suggest you look at the comment I added to SPARK-2429 and see what you think of that approach. If you like the example code and papers, why don't you work on implementing it efficiently in Spark? > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2219) AddJar doesn't work
[ https://issues.apache.org/jira/browse/SPARK-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2219. - Resolution: Fixed Fix Version/s: 1.2.0 > AddJar doesn't work > --- > > Key: SPARK-2219 > URL: https://issues.apache.org/jira/browse/SPARK-2219 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3310) Directly use currentTable without unnecessary implicit conversion
[ https://issues.apache.org/jira/browse/SPARK-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3310. - Resolution: Fixed Fix Version/s: 1.2.0 > Directly use currentTable without unnecessary implicit conversion > - > > Key: SPARK-3310 > URL: https://issues.apache.org/jira/browse/SPARK-3310 > Project: Spark > Issue Type: Improvement >Reporter: Liang-Chi Hsieh >Priority: Minor > Fix For: 1.2.0 > > > We can directly use currentTable in function cacheTable without unnecessary > implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122249#comment-14122249 ] Yu Ishikawa commented on SPARK-2966: I'm sorry for not checking community discussion and JIRA issue. Thank you for let me know. We would be able to implement an approximation algorithm for hierarchical clustering with LSH. I think the approach of this issue is different from that of [SPARK-2429]. Should we merge this issue to [SPARK-2429] ? > Add an approximation algorithm for hierarchical clustering to MLlib > --- > > Key: SPARK-2966 > URL: https://issues.apache.org/jira/browse/SPARK-2966 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > A hierarchical clustering algorithm is a useful unsupervised learning method. > Koga. et al. proposed highly scalable hierarchical clustering altgorithm in > (1). > I would like to implement this method. > I suggest adding an approximate hierarchical clustering algorithm to MLlib. > I'd like this to be assigned to me. > h3. Reference > # Fast agglomerative hierarchical clustering algorithm using > Locality-Sensitive Hashing > http://dl.acm.org/citation.cfm?id=1266811 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework
[ https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122242#comment-14122242 ] Yu Ishikawa commented on SPARK-2430: Hi [~rnowling] , I am very interested in this issue. If possible, I am willing to work with you. I think MLlib's high-level API should be consistent like Scikit-learn. You know, we can use the almost algorithms with `fit` and `predict` function in Scikit-learn. The consisntent API would be helpful for Spark user too. > Standarized Clustering Algorithm API and Framework > -- > > Key: SPARK-2430 > URL: https://issues.apache.org/jira/browse/SPARK-2430 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Recently, there has been a chorus of voices on the mailing lists about adding > new clustering algorithms to MLlib. To support these additions, we should > develop a common framework and API to reduce code duplication and keep the > APIs consistent. > At the same time, we can also expand the current API to incorporate requested > features such as arbitrary distance metrics or pre-computed distance matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting
[ https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122235#comment-14122235 ] Yin Huai commented on SPARK-3390: - Oh, I see the problem. I am out of town this week. Will fix it next week. > sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting > > > Key: SPARK-3390 > URL: https://issues.apache.org/jira/browse/SPARK-3390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Vida Ha >Assignee: Yin Huai >Priority: Critical > > I found a valid JSON string, but which Spark SQL fails to correctly parse: > Try running these lines in a spark-shell to reproduce: > {code:borderStyle=solid} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val badJson = "{\"foo\": [[{\"bar\": 0}]]}" > val rdd = sc.parallelize(badJson :: Nil) > sqlContext.jsonRDD(rdd).count() > {code} > I've tried running these lines on the 1.0.2 release as well latest Spark1.1 > release candidate, and I get this stack trace: > {panel} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 > failed 1 times, most recent failure: Exception failure in TID 7 on host > localhost: scala.MatchError: StructType(List()) (of class > org.apache.spark.sql.catalyst.types.StructType) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333) > > org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > scala.collection.AbstractTraversable.map(Traversable.scala:105) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) > > org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > scala.collection.AbstractTraversable.map(Traversable.scala:105) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365) > scala.Option.map(Option.scala:145) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349) > > org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) > > org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting
[ https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-3390: Summary: sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting (was: sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting) > sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object > nesting > - > > Key: SPARK-3390 > URL: https://issues.apache.org/jira/browse/SPARK-3390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Vida Ha >Assignee: Yin Huai >Priority: Critical > > I found a valid JSON string, but which Spark SQL fails to correctly parse: > Try running these lines in a spark-shell to reproduce: > {code:borderStyle=solid} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val badJson = "{\"foo\": [[{\"bar\": 0}]]}" > val rdd = sc.parallelize(badJson :: Nil) > sqlContext.jsonRDD(rdd).count() > {code} > I've tried running these lines on the 1.0.2 release as well latest Spark1.1 > release candidate, and I get this stack trace: > {panel} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 > failed 1 times, most recent failure: Exception failure in TID 7 on host > localhost: scala.MatchError: StructType(List()) (of class > org.apache.spark.sql.catalyst.types.StructType) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333) > > org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > scala.collection.AbstractTraversable.map(Traversable.scala:105) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) > > org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > scala.collection.AbstractTraversable.map(Traversable.scala:105) > > org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365) > scala.Option.map(Option.scala:145) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349) > > org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) > > org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3406) Python persist API does not have a default storage level
holdenk created SPARK-3406: -- Summary: Python persist API does not have a default storage level Key: SPARK-3406 URL: https://issues.apache.org/jira/browse/SPARK-3406 Project: Spark Issue Type: Bug Components: PySpark Reporter: holdenk Priority: Minor PySpark's persist method on RDD's does not have a default storage level. This is different than the Scala API which defaults to in memory caching. This is minor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122219#comment-14122219 ] Xiangrui Meng commented on SPARK-3403: -- I don't have a Windows system to test. There should be a runtime flag you can set to control the number of threads OpenBLAS use. Could you try that? I will test the code attached on OSX and report back. > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.1.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3405) EC2 cluster creation on VPC
[ https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3405: --- Component/s: (was: PySpark) > EC2 cluster creation on VPC > --- > > Key: SPARK-3405 > URL: https://issues.apache.org/jira/browse/SPARK-3405 > Project: Spark > Issue Type: New Feature > Components: EC2 >Affects Versions: 1.0.2 > Environment: Ubuntu 12.04 >Reporter: Dawson Reid >Priority: Minor > > It would be very useful to be able to specify the EC2 VPC in which the Spark > cluster should be created. > When creating a Spark cluster on AWS via the spark-ec2 script there is no way > to specify a VPC id of the VPC you would like the cluster to be created in. > The script always creates the cluster in the default VPC. > In my case I have deleted the default VPC and the spark-ec2 script errors out > with the following : > Setting up security groups... > Creating security group test-master > ERROR:boto:400 Bad Request > ERROR:boto: > VPCIdNotSpecifiedNo default > VPC for this > user312a2281-81a1-4d3c-ba10-0593a886779d > Traceback (most recent call last): > File "./spark_ec2.py", line 860, in > main() > File "./spark_ec2.py", line 852, in main > real_main() > File "./spark_ec2.py", line 735, in real_main > conn, opts, cluster_name) > File "./spark_ec2.py", line 247, in launch_cluster > master_group = get_or_make_group(conn, cluster_name + "-master") > File "./spark_ec2.py", line 143, in get_or_make_group > return conn.create_security_group(name, "Spark EC2 group") > File > "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py", > line 2011, in create_security_group > File > "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py", > line 925, in get_object > boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request > > VPCIdNotSpecifiedNo default > VPC for this > user312a2281-81a1-4d3c-ba10-0593a886779d -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122114#comment-14122114 ] Hari Shreedharan commented on SPARK-3129: - Looks like simply moving the code that generates the secret and sets in the UGI to the Client class should take care of that. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122103#comment-14122103 ] Hari Shreedharan commented on SPARK-3129: - I am less worried about client mode, since most streaming applications would run in cluster mode. We can make this available only in the cluster mode. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3378) Replace the word "SparkSQL" with right word "Spark SQL"
[ https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3378. - Resolution: Fixed Fix Version/s: 1.2.0 > Replace the word "SparkSQL" with right word "Spark SQL" > --- > > Key: SPARK-3378 > URL: https://issues.apache.org/jira/browse/SPARK-3378 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Trivial > Fix For: 1.2.0 > > > In programming-guide.md, there are 2 "SparkSQL". We should use "Spark SQL" > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122064#comment-14122064 ] Thomas Graves commented on SPARK-3129: -- On yarn, it generates the secret automatically. In cluster mode, it does it in the applicationMaster. Since it generates it in the applicationmaster, it goes away when the application master dies. If the secret was generated on the client side and populated into the credentials in the UGI similar to how we do tokens then a restart of the AM in cluster mode should be able to pick it back up. This won't work for client mode though since the client/spark driver wouldn't have a way to get ahold of the UGI again. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122017#comment-14122017 ] Hari Shreedharan commented on SPARK-3129: - [~tgraves] - Am I correct in assuming that using Akka automatically gives the shared secret authentication if spark.authenticate is set to true - if the AM is restarted by YARN itself (since it is the same application, it theoretically has access to the same shared secret and thus should be able to communicate via Akka)? > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122006#comment-14122006 ] Hari Shreedharan commented on SPARK-3129: - Yes, so my initial goal is to be able to recover all the blocks that have not been made into an RDD yet (at which point it would be safe). There is data which may not have become a block yet (which are created using the += operator) - for now, I am going to call it fair game to say that we are going to be adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that store data such that they are guaranteed to be recovered. At a later stage, we could use something like a WAL on HDFS to recover even the += data, though that would affect performance. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3405) EC2 cluster creation on VPC
Dawson Reid created SPARK-3405: -- Summary: EC2 cluster creation on VPC Key: SPARK-3405 URL: https://issues.apache.org/jira/browse/SPARK-3405 Project: Spark Issue Type: New Feature Components: EC2, PySpark Affects Versions: 1.0.2 Environment: Ubuntu 12.04 Reporter: Dawson Reid Priority: Minor It would be very useful to be able to specify the EC2 VPC in which the Spark cluster should be created. When creating a Spark cluster on AWS via the spark-ec2 script there is no way to specify a VPC id of the VPC you would like the cluster to be created in. The script always creates the cluster in the default VPC. In my case I have deleted the default VPC and the spark-ec2 script errors out with the following : Setting up security groups... Creating security group test-master ERROR:boto:400 Bad Request ERROR:boto: VPCIdNotSpecifiedNo default VPC for this user312a2281-81a1-4d3c-ba10-0593a886779d Traceback (most recent call last): File "./spark_ec2.py", line 860, in main() File "./spark_ec2.py", line 852, in main real_main() File "./spark_ec2.py", line 735, in real_main conn, opts, cluster_name) File "./spark_ec2.py", line 247, in launch_cluster master_group = get_or_make_group(conn, cluster_name + "-master") File "./spark_ec2.py", line 143, in get_or_make_group return conn.create_security_group(name, "Spark EC2 group") File "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py", line 2011, in create_security_group File "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py", line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request VPCIdNotSpecifiedNo default VPC for this user312a2281-81a1-4d3c-ba10-0593a886779d -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121883#comment-14121883 ] Matei Zaharia commented on SPARK-640: - [~pwendell] what is our Hadoop 1 version on AMIs now? > Update Hadoop 1 version to 1.1.0 (especially on AMIs) > - > > Key: SPARK-640 > URL: https://issues.apache.org/jira/browse/SPARK-640 > Project: Spark > Issue Type: New Feature >Reporter: Matei Zaharia > > Hadoop 1.1.0 has a fix to the notorious "trailing slash for directory objects > in S3" issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be > good to support on the AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2334: -- Affects Version/s: 1.1.0 > Attribute Error calling PipelinedRDD.id() in pyspark > > > Key: SPARK-2334 > URL: https://issues.apache.org/jira/browse/SPARK-2334 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0, 1.1.0 >Reporter: Diana Carroll > > calling the id() function of a PipelinedRDD causes an error in PySpark. > (Works fine in Scala.) > The second id() call here fails, the first works: > {code} > r1 = sc.parallelize([1,2,3]) > r1.id() > r2=r1.map(lambda i: i+1) > r2.id() > {code} > Error: > {code} > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 r2.id() > /usr/lib/spark/python/pyspark/rdd.py in id(self) > 180 A unique ID for this RDD (within its SparkContext). > 181 """ > --> 182 return self._id > 183 > 184 def __repr__(self): > AttributeError: 'PipelinedRDD' object has no attribute '_id' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3061) Maven build fails in Windows OS
[ https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-3061: - Assignee: Andrew Or (was: Josh Rosen) Re-assigning to Andrew, who's going to backport it. > Maven build fails in Windows OS > --- > > Key: SPARK-3061 > URL: https://issues.apache.org/jira/browse/SPARK-3061 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 > Environment: Windows >Reporter: Masayoshi TSUZUKI >Assignee: Andrew Or >Priority: Minor > Fix For: 1.2.0 > > > Maven build fails in Windows OS with this error message. > {noformat} > [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec > (default) on project spark-core_2.10: Command execution failed. Cannot run > program "unzip" (in directory "C:\path\to\gitofspark\python"): CreateProcess > error=2, w肳ꂽt@ -> [Help 1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2015) Spark UI issues at scale
[ https://issues.apache.org/jira/browse/SPARK-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2015: -- Component/s: Web UI > Spark UI issues at scale > > > Key: SPARK-2015 > URL: https://issues.apache.org/jira/browse/SPARK-2015 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Reynold Xin > > This is an umbrella ticket for issues related to Spark's web ui when we run > Spark at scale (large datasets, large number of machines, or large number of > tasks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows
[ https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravesh Jain updated SPARK-3284: Description: {code} object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount") val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") } } {code} gives the error Exception in thread "main" java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. was: object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount") val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") } } gives the error Exception in thread "main" java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. > saveAsParquetFile not working on windows > > > Key: SPARK-3284 > URL: https://issues.apache.org/jira/browse/SPARK-3284 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2 > Environment: Windows >Reporter: Pravesh Jain >Priority: Minor > > {code} > object parquet { > case class Person(name: String, age: Int) > def main(args: Array[String]) { > val sparkConf = new > SparkConf().setMaster("local").setAppName("HdfsWordCount") > val sc = new SparkContext(sparkConf) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. > import sqlContext.createSchemaRDD > val people = > sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > > people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") > val parquetFile = > sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet") > } > } > {code} > gives the error > Exception in thread "main" java.lang.NullPointerException at > org.apache.spark.parquet$.main(parquet.scala:16) > which is the line saveAsParquetFile. > This works fine in linux but using in eclipse in windows gives the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3286: -- Component/s: Web UI > Cannot view ApplicationMaster UI when Yarn’s url scheme is https > > > Key: SPARK-3286 > URL: https://issues.apache.org/jira/browse/SPARK-3286 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 1.0.2 >Reporter: Benoy Antony > Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch > > > The spark Application Master starts its web UI at http://:port. > When Spark ApplicationMaster registers its URL with Resource Manager , the > URL does not contain URI scheme. > If the URL scheme is absent, Resource Manager’s web app proxy will use the > HTTP Policy of the Resource Manager.(YARN-1553) > If the HTTP Policy of the Resource Manager is https, then web app proxy will > try to access https://:port. > This will result in error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1078) Replace lift-json with json4s-jackson
[ https://issues.apache.org/jira/browse/SPARK-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1078. --- Resolution: Fixed Fix Version/s: 1.0.0 It looks like this was fixed in SPARK-1132 / Spark 1.0.0, where we migrated to json4s.jackson. > Replace lift-json with json4s-jackson > - > > Key: SPARK-1078 > URL: https://issues.apache.org/jira/browse/SPARK-1078 > Project: Spark > Issue Type: Task > Components: Deploy, Web UI >Affects Versions: 0.9.0 >Reporter: William Benton >Priority: Minor > Fix For: 1.0.0 > > > json4s-jackson is a Jackson-backed implementation of the Json4s common JSON > API for Scala JSON libraries. (Evan Chan has a nice comparison of Scala JSON > libraries here: > http://engineering.ooyala.com/blog/comparing-scala-json-libraries) It is > Apache-licensed, mostly API-compatible with lift-json, and easier for > downstream operating system distributions to consume than lift-json. > In terms of performance, json4s-jackson is slightly slower but comparable to > lift-json on my machine when parsing very small JSON files (< 2kb and < ~30 > objects), around 40% faster than lift-json on medium-sized files (~50kb), and > significantly (~10x) faster on multi-megabyte files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121653#comment-14121653 ] Sean Owen commented on SPARK-3404: -- It's 100% repeatable in Maven for me locally, which seems to be Jenkins' experience too. I don't see the same problem with SBT (/dev/run-tests) locally, although I can't say I run that regularly. I could rewrite the SparkSubmitSuite to submit a JAR file that actually contains the class it's trying to invoke. Maybe that's smarter? the problem here seems to be the vagaries of what the run-time classpath is during an SBT vs Maven test. Would anyone second that? Separately it would probably not hurt to get in that change that logs stdout / stderr from the Utils method. > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen >Priority: Critical > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/a
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Affects Version/s: 1.1.0 > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen >Priority: Critical > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) > This is for example SimpleApplicationTest > (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) > The test actually submits an empty JAR not containing this class. It relies > on {{spark-submit}} finding the class within the compiled test-classes of the > Spark project. However it does seem to be compiled and present even with > Maven. > If modified to print stdout and stderr, and dump the actual command, I see an > empty stdout, and only
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121650#comment-14121650 ] Andrew Or commented on SPARK-3404: -- I have updated the title to reflect this. > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen >Priority: Critical > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) > This is for example SimpleApplicationTest > (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) > The test actually submits an empty JAR not containing this class. It relies > on {{spark-submit}} finding the class within the compiled test-classes of the > Spark project. However it does seem to be compiled and present even with > Maven. > If modified to print stdout and s
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Target Version/s: 1.1.1 > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen >Priority: Critical > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) > This is for example SimpleApplicationTest > (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) > The test actually submits an empty JAR not containing this class. It relies > on {{spark-submit}} finding the class within the compiled test-classes of the > Spark project. However it does seem to be compiled and present even with > Maven. > If modified to print stdout and stderr, and dump the actual command, I see an > empty stdout, and only t
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Summary: SparkSubmitSuite fails with "spark-submit exits with code 1" (was: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1) > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) > This is for example SimpleApplicationTest > (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) > The test actually submits an empty JAR not containing this class. It relies > on {{spark-submit}} finding the class within the compiled test-classes of the > Spark project. However it does seem to be compiled and present even with > Maven. > If modifie
[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3404: - Priority: Critical (was: Major) > SparkSubmitSuite fails with "spark-submit exits with code 1" > > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2, 1.1.0 >Reporter: Sean Owen >Priority: Critical > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com/apache/spark/pull/2108/files and > https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at > least print stdout to the log too. > The SparkSubmit program exits with 1 when the main class it is supposed to > run is not found > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) > This is for example SimpleApplicationTest > (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) > The test actually submits an empty JAR not containing this class. It relies > on {{spark-submit}} finding the class within the compiled test-classes of the > Spark project. However it does seem to be compiled and present even with > Maven. > If modified to print stdout and stderr, and dump the actual command, I see an > empty stdout, a
[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
[ https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121646#comment-14121646 ] Andrew Or commented on SPARK-3404: -- Thanks for looking into this Sean. Does this happen all the time or only once in a while? We have observed the same tests failing on our Jenkins, which runs the test through sbt. The behavior is consistent with running it through maven. If we run it through 'sbt test-only SparkSubmitSuite' then it always passes, but if we run 'sbt test' then sometimes it fails. This has also been failing for a while for sbt. Very roughly I remember we began seeing it after https://github.com/apache/spark/pull/1777 went in. Though I have gone down that path to debug any possibilities of port collision to no avail. A related test failure is in DriverSuite, which also calls `Utils.executeAndGetOutput`. Have you seen that failing in maven? I will keep investigating it in parallel for sbt, though I suspect the root cause is the same. Let me know if you find anything. > SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1 > --- > > Key: SPARK-3404 > URL: https://issues.apache.org/jira/browse/SPARK-3404 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.2 >Reporter: Sean Owen > > Maven-based Jenkins builds have been failing for over a month. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ > It's SparkSubmitSuite that fails. For example: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull > {code} > SparkSubmitSuite > ... > - launch simple application with spark-submit *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, > local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > - spark submit includes jars passed in through --jar *** FAILED *** > org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, > org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, > local-cluster[2,1,512], --jars, > file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, > file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 > at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) > at > org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at > org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > ... > {code} > SBT builds don't fail, so it is likely to be due to some difference in how > the tests are run rather than a problem with test or core project. > This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the > cause identified in that JIRA is, at least, not the only cause. (Although, it > wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins > config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) > This JIRA tracks investigation into a different cause. Right now I have some > further information but not a PR yet. > Part of the issue is that there is no clue in the log about why > {{spark-submit}} exited with status 1. See > https://github.com
[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121608#comment-14121608 ] Apache Spark commented on SPARK-3286: - User 'benoyantony' has created a pull request for this issue: https://github.com/apache/spark/pull/2276 > Cannot view ApplicationMaster UI when Yarn’s url scheme is https > > > Key: SPARK-3286 > URL: https://issues.apache.org/jira/browse/SPARK-3286 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.2 >Reporter: Benoy Antony > Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch > > > The spark Application Master starts its web UI at http://:port. > When Spark ApplicationMaster registers its URL with Resource Manager , the > URL does not contain URI scheme. > If the URL scheme is absent, Resource Manager’s web app proxy will use the > HTTP Policy of the Resource Manager.(YARN-1553) > If the HTTP Policy of the Resource Manager is https, then web app proxy will > try to access https://:port. > This will result in error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121561#comment-14121561 ] Helena Edelson edited comment on SPARK-2892 at 9/4/14 5:01 PM: --- I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver for stream" seems like the expected behavior. DEBUG 13:00:22,418 Stopping JobScheduler INFO 13:00:22,441 Received stop signal INFO 13:00:22,441 Sent stop signal to all 1 receivers INFO 13:00:22,442 Stopping receiver with message: Stopped by driver: INFO 13:00:22,442 Called receiver onStop INFO 13:00:22,443 Deregistering receiver 0 ERROR 13:00:22,445 Deregistered receiver for stream 0: Stopped by driver INFO 13:00:22,445 Stopped receiver 0 was (Author: helena_e): I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver for stream" seems like the expected behavior. > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 10.179263 s > 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been > terminated > 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not > deregistered, Map(0 -> > ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) > 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped > 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately > 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after > time 1407375433000 > 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator > 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler > 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully > 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving > 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121563#comment-14121563 ] Alexander Ulanov commented on SPARK-3403: - Yes, I tried using netlib-java separately with the same OpenBLAS setup and it worked properly, even within several threads. However I didn't mimic the same multi-threading setup as MLlib has because it is complicated. Do you want me to send you all DLLs that I used? I had troubles with compiling OpenBLAS for Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites. > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.1.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121561#comment-14121561 ] Helena Edelson commented on SPARK-2892: --- I wonder if the ERROR should be a WARN or INFO since it occurs as a result of ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver for stream" seems like the expected behavior. > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 10.179263 s > 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been > terminated > 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not > deregistered, Map(0 -> > ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) > 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped > 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately > 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after > time 1407375433000 > 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator > 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler > 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully > 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving > 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
Sean Owen created SPARK-3404: Summary: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1 Key: SPARK-3404 URL: https://issues.apache.org/jira/browse/SPARK-3404 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for over a month. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/ It's SparkSubmitSuite that fails. For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull {code} SparkSubmitSuite ... - launch simple application with spark-submit *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - spark submit includes jars passed in through --jar *** FAILED *** org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar, file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837) at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... {code} SBT builds don't fail, so it is likely to be due to some difference in how the tests are run rather than a problem with test or core project. This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the cause identified in that JIRA is, at least, not the only cause. (Although, it wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.) This JIRA tracks investigation into a different cause. Right now I have some further information but not a PR yet. Part of the issue is that there is no clue in the log about why {{spark-submit}} exited with status 1. See https://github.com/apache/spark/pull/2108/files and https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at least print stdout to the log too. The SparkSubmit program exits with 1 when the main class it is supposed to run is not found (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322) This is for example SimpleApplicationTest (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339) The test actually submits an empty JAR not containing this class. It relies on {{spark-submit}} finding the class within the compiled test-classes of the Spark project. However it does seem to be compiled and present even with Maven. If modified to print stdout and stderr, and dump the actual command, I see an empty stdout, and only the command to stderr: {code} Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp null::/Users/srowen/Documents/spark/conf:/Users/srowen/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar:/Users/srowen/Documents/spark/core/target/scala-2.10/test-classes:/Users/srowen/Documents/spark/repl
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121516#comment-14121516 ] Xiangrui Meng commented on SPARK-3403: -- Did you test the setup of netlib-java with OpenBLAS? I hit a JNI issue (a year ago, maybe fixed) with netlib-java and multithreading OpenBLAS. Could you try compiling OpenBLAS with `USE_THREAD=0`? If it still doesn't work, please attach the driver/executor logs. Thanks! > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.1.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-3403: Attachment: NativeNN.scala The file contains example that produces the same issue > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.1.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
Alexander Ulanov created SPARK-3403: --- Summary: NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) Key: SPARK-3403 URL: https://issues.apache.org/jira/browse/SPARK-3403 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Environment: Setup: Windows 7, x64 libraries for netlib-java (as described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and MinGW64 precompiled dlls. Reporter: Alexander Ulanov Fix For: 1.1.0 Code: val model = NaiveBayes.train(train) val predictionAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } predictionAndLabels.foreach(println) Result: program crashes with: "Process finished with exit code -1073741819 (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121451#comment-14121451 ] Apache Spark commented on SPARK-3375: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/2275 > spark on yarn container allocation issues > - > > Key: SPARK-3375 > URL: https://issues.apache.org/jira/browse/SPARK-3375 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Blocker > > It looks like if yarn doesn't get the containers immediately it stops asking > for them and the yarn application hangs with never getting any executors. > This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-3375: Assignee: Thomas Graves > spark on yarn container allocation issues > - > > Key: SPARK-3375 > URL: https://issues.apache.org/jira/browse/SPARK-3375 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Blocker > > It looks like if yarn doesn't get the containers immediately it stops asking > for them and the yarn application hangs with never getting any executors. > This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121324#comment-14121324 ] Helena Edelson edited comment on SPARK-2892 at 9/4/14 1:12 PM: --- I see the same with 1.0.2 streaming, with or without stopGracefully = true ssc.stop(stopSparkContext = false, stopGracefully = true) ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) was (Author: helena_e): I see the same with 1.0.2 streaming: ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 10.179263 s > 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been > terminated > 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not > deregistered, Map(0 -> > ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) > 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped > 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately > 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after > time 1407375433000 > 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator > 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler > 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully > 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving > 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121324#comment-14121324 ] Helena Edelson commented on SPARK-2892: --- I see the same with 1.0.2 streaming: ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver WARN 08:26:21,211 Stopped executor without error WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,)) > Socket Receiver does not stop when streaming context is stopped > --- > > Key: SPARK-2892 > URL: https://issues.apache.org/jira/browse/SPARK-2892 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Running NetworkWordCount with > {quote} > ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); > Thread.sleep(6) > {quote} > gives the following error > {quote} > 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 10047 ms on localhost (1/1) > 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at > ReceiverTracker.scala:275) finished in 10.056 s > 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at > ReceiverTracker.scala:275, took 10.179263 s > 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been > terminated > 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not > deregistered, Map(0 -> > ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) > 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped > 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately > 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after > time 1407375433000 > 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator > 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler > 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully > 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving > 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121276#comment-14121276 ] David commented on SPARK-1473: -- Hi you all, I am Dr. David Martinez and this is my first comment of this project. We implemented all feature selection methods included in •Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.The Journal of Machine Learning Research, 13, 27-66 included more optimizations and left the framework open to include more criteria. We opened a pull request in the past but did not finished it. You can have a look in our github https://github.com/LIDIAgroup/SparkFeatureSelection We would like to finish our pull request > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3402) Library for Natural Language Processing over Spark.
[ https://issues.apache.org/jira/browse/SPARK-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121190#comment-14121190 ] Nagamallikarjuna commented on SPARK-3402: - We have gone through Spark and its family, we didn't find any natural language processing library over spark. We (Impetus) are working to implement some natural language features over Spark. We already developed some working algorithms library using OpenNLP tool kit, and will extend to other NLP tool kits like Stanford, CTakes, NLTK etc.. We are planning to contribute our work to existing MLLib or new sub project. Thanks Naga > Library for Natural Language Processing over Spark. > --- > > Key: SPARK-3402 > URL: https://issues.apache.org/jira/browse/SPARK-3402 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Nagamallikarjuna >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3402) Library for Natural Language Processing over Spark.
Nagamallikarjuna created SPARK-3402: --- Summary: Library for Natural Language Processing over Spark. Key: SPARK-3402 URL: https://issues.apache.org/jira/browse/SPARK-3402 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Nagamallikarjuna Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121173#comment-14121173 ] Chengxiang Li commented on SPARK-2321: -- I'm not sure whether i understand you right, here is my thought about the API design: # The JobStatus/JobStatistic API only contains getter method. # JobProgressListener contains variables of JobStatusImpl/JobStatisticImpl. # DagScheduler post events to JobProgressListener through listener bus. # Caller get JobStatusImpl/JobStatisticImpl from JobProgressListener with updated state. So i think it should be a pull style API. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121160#comment-14121160 ] Saisai Shao commented on SPARK-3129: Hi [~hshreedharan], one more question: Is your design goal trying to fix the receiver node failure caused data loss issue? Seems potentially data will be lost when data is only stored in BlockGenerator not yet in BM when node is failed. Your design doc mainly focused on driver failure, so what's your thought? > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options
[ https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-529. - Resolution: Won't Fix This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, and I suppose there is spark-env.sh too. > Have a single file that controls the environmental variables and spark config > options > - > > Key: SPARK-529 > URL: https://issues.apache.org/jira/browse/SPARK-529 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin > > E.g. multiple places in the code base uses SPARK_MEM and has its own default > set to 512. We need a central place to enforce default values as well as > documenting the variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)
[ https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-640. - Resolution: Fixed This looks stale right? Hadoop 1 version has been at 1.2.1 for some time. > Update Hadoop 1 version to 1.1.0 (especially on AMIs) > - > > Key: SPARK-640 > URL: https://issues.apache.org/jira/browse/SPARK-640 > Project: Spark > Issue Type: New Feature >Reporter: Matei Zaharia > > Hadoop 1.1.0 has a fix to the notorious "trailing slash for directory objects > in S3" issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be > good to support on the AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3377: -- Summary: Don't mix metrics from different applications (was: codahale base Metrics data between applications can jumble up together) > Don't mix metrics from different applications > - > > Key: SPARK-3377 > URL: https://issues.apache.org/jira/browse/SPARK-3377 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Kousuke Saruta >Priority: Critical > > I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I > saw following 2 problems. > (1) When applications which have same spark.app.name run on cluster at the > same time, some metrics names jumble up together. e.g, > SparkPi.DAGScheduler.stage.failedStages jumble. > (2) When 2+ executors run on the same machine, JVM metrics of each executors > jumble. e.g, We current implementation cannot distinguish metric "jvm.memory" > is for which executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121115#comment-14121115 ] Apache Spark commented on SPARK-2978: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2274 > Provide an MR-style shuffle transformation > -- > > Key: SPARK-2978 > URL: https://issues.apache.org/jira/browse/SPARK-2978 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Sandy Ryza > > For Hive on Spark joins in particular, and for running legacy MR code in > general, I think it would be useful to provide a transformation with the > semantics of the Hadoop MR shuffle, i.e. one that > * groups by key: provides (Key, Iterator[Value]) > * within each partition, provides keys in sorted order > A couple ways that could make sense to expose this: > * Add a new operator. "groupAndSortByKey", > "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe? > * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121094#comment-14121094 ] Reynold Xin commented on SPARK-2321: What about pull vs push? i.e. should this be a listener like API, or some service with states that the caller can poll to ask? > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121085#comment-14121085 ] Chengxiang Li commented on SPARK-2321: -- I collect some hive side requirement here, which should be helpful for spark job status and statistic API design. Hive should be able to get the following job status information through Spark job status API. 1. job identifier 2. current job execution state, should include RUNNING/SUCCEEDED/FAILED/KILLED. 3. running/failed/killed/total task number on job level. 4. stage identifier 5. stage state, should include RUNNING/SUCCEEDED/FAILED/KILLED 6. running/failed/killed/total task number on stage level. MR/Tez use Counter to collect statistic information, similiar to MR/Tez Counter, it would be better if Spark job statistic API organize statistic information with: 1. group same kind statistic information by groupName. 2. displayName for both group and statistic information which would uniform print string for frontend(Web UI/Hive CLI/...). > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)
[ https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121075#comment-14121075 ] Apache Spark commented on SPARK-3353: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2273 > Stage id monotonicity (parent stage should have lower stage id) > --- > > Key: SPARK-3353 > URL: https://issues.apache.org/jira/browse/SPARK-3353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Reynold Xin > > The way stage IDs are generated is that parent stages actually have higher > stage id. This is very confusing because parent stages get scheduled & > executed first. > We should reverse that order so the scheduling timeline of stages (absent of > failures) is monotonic, i.e. stages that are executed first have lower stage > ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org