date:20140904

[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal

2014-09-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3410:
--
Summary: The priority of shutdownhook for ApplicationMaster should not be 
integer literal  (was: The priority of shutdownhook for ApplicationMaster 
should not be integer literal, rather than refer constant.)

> The priority of shutdownhook for ApplicationMaster should not be integer 
> literal
> 
>
> Key: SPARK-3410
> URL: https://issues.apache.org/jira/browse/SPARK-3410
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In ApplicationMaster, the priority of shutdown hook is set to 30, which 
> expects higher than the priority of o.a.h.FileSystem.
> In FileSystem, the priority of shutdown hook is expressed as public constant 
> named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant 
> for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3411) Optimize the schedule procedure in Master

2014-09-04 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-3411:
--

 Summary: Optimize the schedule procedure in Master
 Key: SPARK-3411
 URL: https://issues.apache.org/jira/browse/SPARK-3411
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


If the waiting driver array is too big, the drivers in it will be dispatched to 
the first worker we get(if it has enough resources), with or without the 
Randomization.

We should do randomization every time we dispatch a driver, in order to better 
balance drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122465#comment-14122465
 ] 

Apache Spark commented on SPARK-3410:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2283

> The priority of shutdownhook for ApplicationMaster should not be integer 
> literal, rather than refer constant.
> -
>
> Key: SPARK-3410
> URL: https://issues.apache.org/jira/browse/SPARK-3410
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In ApplicationMaster, the priority of shutdown hook is set to 30, which 
> expects higher than the priority of o.a.h.FileSystem.
> In FileSystem, the priority of shutdown hook is expressed as public constant 
> named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant 
> for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.

2014-09-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3410:
--
Issue Type: Improvement  (was: Bug)

> The priority of shutdownhook for ApplicationMaster should not be integer 
> literal, rather than refer constant.
> -
>
> Key: SPARK-3410
> URL: https://issues.apache.org/jira/browse/SPARK-3410
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In ApplicationMaster, the priority of shutdown hook is set to 30, which 
> expects higher than the priority of o.a.h.FileSystem.
> In FileSystem, the priority of shutdown hook is expressed as public constant 
> named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant 
> for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.

2014-09-04 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-3410:
-

 Summary: The priority of shutdownhook for ApplicationMaster should 
not be integer literal, rather than refer constant.
 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor


In ApplicationMaster, the priority of shutdown hook is set to 30, which expects 
higher than the priority of o.a.h.FileSystem.
In FileSystem, the priority of shutdown hook is expressed as public constant 
named "SHUTDOWN_HOOK_PRIORITY" so I think it's better to use this constant for 
the priority of ApplicationMaster's shutdown hook.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122408#comment-14122408
 ] 

Apache Spark commented on SPARK-3409:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2282

> Avoid pulling in Exchange operator itself in Exchange's closures
> 
>
> Key: SPARK-3409
> URL: https://issues.apache.org/jira/browse/SPARK-3409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> {code}
> val rdd = child.execute().mapPartitions { iter =>
>   if (sortBasedShuffleOn) {
> iter.map(r => (null, r.copy()))
>   } else {
> val mutablePair = new MutablePair[Null, Row]()
> iter.map(r => mutablePair.update(null, r))
>   }
> }
> {code}
> The above snippet from Exchange references sortBasedShuffleOn within a 
> closure, which requires pulling in the entire Exchange object in the closure. 
> This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3408) Limit operator doesn't work with sort based shuffle

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122407#comment-14122407
 ] 

Apache Spark commented on SPARK-3408:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2281

> Limit operator doesn't work with sort based shuffle
> ---
>
> Key: SPARK-3408
> URL: https://issues.apache.org/jira/browse/SPARK-3408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122390#comment-14122390
 ] 

sam commented on SPARK-1473:


[~dmm...@gmail.com]  mentioning also (i cant work which david is the one that 
posted above)

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122386#comment-14122386
 ] 

sam commented on SPARK-1473:


Good paper, the theory is very solid. My only concern is that the paper does 
not explicitly tackle the problem of probability estimation for high 
dimensionality, which for sparse data will be even worse. It just touches on 
the problem, saying:

"This in turn causes increasingly poor judgements for the in- clusion/exclusion 
of features. For precisely this reason, the research community have developed 
various low-dimensional approximations to (9). In the following sections, we 
will investigate the implicit statistical assumptions and empirical effects of 
these approximations"

Those mentioned sections do not go into theoretical detail, and therefore I 
disagree that the paper provides a "single unified information theoretic 
framework for feature selection" as it basically leaves the problem of 
probability estimation to the readers choice, and merely suggests the reader 
assumes some level of independence between features in order to implement an 
algorithm.

 [~dmborque]  Do you know of any literature that does approach the problem of 
probability estimation in an information theoretic and philosophically 
justified way?? 

Anyway despite my concerns, this paper is still by far the best treatment of 
feature selection I have seen.

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-04 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-3409:
--

 Summary: Avoid pulling in Exchange operator itself in Exchange's 
closures
 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


{code}
val rdd = child.execute().mapPartitions { iter =>
  if (sortBasedShuffleOn) {
iter.map(r => (null, r.copy()))
  } else {
val mutablePair = new MutablePair[Null, Row]()
iter.map(r => mutablePair.update(null, r))
  }
}
{code}

The above snippet from Exchange references sortBasedShuffleOn within a closure, 
which requires pulling in the entire Exchange object in the closure. 

This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3408) Limit operator doesn't work with sort based shuffle

2014-09-04 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-3408:
--

 Summary: Limit operator doesn't work with sort based shuffle
 Key: SPARK-3408
 URL: https://issues.apache.org/jira/browse/SPARK-3408
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3392) Set command always get for key "mapred.reduce.tasks"

2014-09-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3392.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Set command always get  for key "mapred.reduce.tasks"
> 
>
> Key: SPARK-3392
> URL: https://issues.apache.org/jira/browse/SPARK-3392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Trivial
> Fix For: 1.2.0
>
>
> This is a tiny fix for getting the value of "mapred.reduce.tasks", which make 
> more sense for the hive user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122292#comment-14122292
 ] 

Saisai Shao commented on SPARK-2926:


Hi Matei, sorry for late response, I will test more scenarios with your notes, 
also factor out to see if some codes can be shared with ExternalSorter. Thanks 
a lot.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3407) Add Date type support

2014-09-04 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-3407:


 Summary: Add Date type support
 Key: SPARK-3407
 URL: https://issues.apache.org/jira/browse/SPARK-3407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122274#comment-14122274
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan]], thanks for your reply, is this PR 
(https://github.com/apache/spark/pull/1195) the one you mentioned about 
storeReliably()? 

According to my knowledge, this API aims to store bunch of messages into BM 
directly to make it reliable, but for some receiver like Kafka, socket and 
others, data is injected one by one message, we can't call storeReliably() each 
time because of efficiency and throughput concern, so we need to store these 
data locally to some amount, and then flush to BM using storeReliably(). So I 
think data will potentially be lost as we store it locally. These days I 
thought about WAL things, IMHO i think WAL would be a better solution compared 
to blocked store API.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-04 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122273#comment-14122273
 ] 

RJ Nowling commented on SPARK-2430:
---

Hi Yu,

The community had suggested looking into scikit-learn's API so that is a good 
idea.

I am hesitant to make backwards-incompatible API changes, however, until we 
know the new API will be stable for a long time.  I think it would be best to 
implement a few more clustering algorithms to get a clear idea of what is 
similar vs different before making a new API.  May I suggest you work on 
SPARK-2966 / SPARK-2429 first?

RJ

> Standarized Clustering Algorithm API and Framework
> --
>
> Key: SPARK-2430
> URL: https://issues.apache.org/jira/browse/SPARK-2430
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Recently, there has been a chorus of voices on the mailing lists about adding 
> new clustering algorithms to MLlib.  To support these additions, we should 
> develop a common framework and API to reduce code duplication and keep the 
> APIs consistent.
> At the same time, we can also expand the current API to incorporate requested 
> features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-04 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122266#comment-14122266
 ] 

RJ Nowling commented on SPARK-2966:
---

No worries.

Based on my reading of the Spark contribution guidelines ( 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I 
think that the Spark community would prefer to have one good implementation of 
an algorithm instead of multiple similar algorithms.

Since the community has stated a clear preference for divisive hierarchical 
clustering, I think that is a better aim.  You seem very motivated and have 
made some good contributions -- would you like to take the lead on the 
hierarchical clustering?  I can review your code to help you improve it.

That said, I suggest you look at the comment I added to SPARK-2429 and see what 
you think of that approach.  If you like the example code and papers, why don't 
you work on implementing it efficiently in Spark?

> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2219) AddJar doesn't work

2014-09-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2219.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> AddJar doesn't work
> ---
>
> Key: SPARK-2219
> URL: https://issues.apache.org/jira/browse/SPARK-2219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3310) Directly use currentTable without unnecessary implicit conversion

2014-09-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3310.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Directly use currentTable without unnecessary implicit conversion
> -
>
> Key: SPARK-3310
> URL: https://issues.apache.org/jira/browse/SPARK-3310
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 1.2.0
>
>
> We can directly use currentTable in function cacheTable without unnecessary 
> implicit conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-04 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122249#comment-14122249
 ] 

Yu Ishikawa commented on SPARK-2966:


I'm sorry for not checking community discussion and JIRA issue. Thank you for 
let me know.

We would be able to implement an approximation algorithm for hierarchical 
clustering with LSH. I think the approach of this issue is different from that 
of [SPARK-2429]. Should we merge this issue to [SPARK-2429] ?

> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-04 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122242#comment-14122242
 ] 

Yu Ishikawa commented on SPARK-2430:


Hi [~rnowling] ,

I am very interested in this issue.
If possible, I am willing to work with you.

I think MLlib's high-level API should be consistent like Scikit-learn.
You know, we can use the almost algorithms with  `fit` and `predict` function 
in Scikit-learn.
The consisntent API would be helpful for Spark user too.

> Standarized Clustering Algorithm API and Framework
> --
>
> Key: SPARK-2430
> URL: https://issues.apache.org/jira/browse/SPARK-2430
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Priority: Minor
>
> Recently, there has been a chorus of voices on the mailing lists about adding 
> new clustering algorithms to MLlib.  To support these additions, we should 
> develop a common framework and API to reduce code duplication and keep the 
> APIs consistent.
> At the same time, we can also expand the current API to incorporate requested 
> features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting

2014-09-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122235#comment-14122235
 ] 

Yin Huai commented on SPARK-3390:
-

Oh, I see the problem. I am out of town this week. Will fix it next week.

> sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting
> 
>
> Key: SPARK-3390
> URL: https://issues.apache.org/jira/browse/SPARK-3390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Vida Ha
>Assignee: Yin Huai
>Priority: Critical
>
> I found a valid JSON string, but which Spark SQL fails to correctly parse:
> Try running these lines in a spark-shell to reproduce:
> {code:borderStyle=solid}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val badJson = "{\"foo\": [[{\"bar\": 0}]]}"
> val rdd = sc.parallelize(badJson :: Nil)
> sqlContext.jsonRDD(rdd).count()
> {code}
> I've tried running these lines on the 1.0.2 release as well latest Spark1.1 
> release candidate, and I get this stack trace:
> {panel}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 
> failed 1 times, most recent failure: Exception failure in TID 7 on host 
> localhost: scala.MatchError: StructType(List()) (of class 
> org.apache.spark.sql.catalyst.types.StructType)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365)
> scala.Option.map(Option.scala:145)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting

2014-09-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-3390:

Summary: sqlContext.jsonRDD fails on a complex structure of JSON array and 
JSON object nesting  (was: sqlContext.jsonRDD fails on a complex structure of 
array and hashmap nesting)

> sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object 
> nesting
> -
>
> Key: SPARK-3390
> URL: https://issues.apache.org/jira/browse/SPARK-3390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Vida Ha
>Assignee: Yin Huai
>Priority: Critical
>
> I found a valid JSON string, but which Spark SQL fails to correctly parse:
> Try running these lines in a spark-shell to reproduce:
> {code:borderStyle=solid}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val badJson = "{\"foo\": [[{\"bar\": 0}]]}"
> val rdd = sc.parallelize(badJson :: Nil)
> sqlContext.jsonRDD(rdd).count()
> {code}
> I've tried running these lines on the 1.0.2 release as well latest Spark1.1 
> release candidate, and I get this stack trace:
> {panel}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 
> failed 1 times, most recent failure: Exception failure in TID 7 on host 
> localhost: scala.MatchError: StructType(List()) (of class 
> org.apache.spark.sql.catalyst.types.StructType)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365)
> scala.Option.map(Option.scala:145)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
> 
> org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3406) Python persist API does not have a default storage level

2014-09-04 Thread holdenk (JIRA)

holdenk created SPARK-3406:
--

 Summary: Python persist API does not have a default storage level
 Key: SPARK-3406
 URL: https://issues.apache.org/jira/browse/SPARK-3406
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: holdenk
Priority: Minor


PySpark's persist method on RDD's does not have a default storage level. This 
is different than the Scala API which defaults to in memory caching. This is 
minor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122219#comment-14122219
 ] 

Xiangrui Meng commented on SPARK-3403:
--

I don't have a Windows system to test. There should be a runtime flag you can 
set to control the number of threads OpenBLAS use. Could you try that? I will 
test the code attached on OSX and report back.

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3405) EC2 cluster creation on VPC

2014-09-04 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3405:
---
Component/s: (was: PySpark)

> EC2 cluster creation on VPC
> ---
>
> Key: SPARK-3405
> URL: https://issues.apache.org/jira/browse/SPARK-3405
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Affects Versions: 1.0.2
> Environment: Ubuntu 12.04
>Reporter: Dawson Reid
>Priority: Minor
>
> It would be very useful to be able to specify the EC2 VPC in which the Spark 
> cluster should be created. 
> When creating a Spark cluster on AWS via the spark-ec2 script there is no way 
> to specify a VPC id of the VPC you would like the cluster to be created in. 
> The script always creates the cluster in the default VPC. 
> In my case I have deleted the default VPC and the spark-ec2 script errors out 
> with the following : 
> Setting up security groups...
> Creating security group test-master
> ERROR:boto:400 Bad Request
> ERROR:boto:
> VPCIdNotSpecifiedNo default 
> VPC for this 
> user312a2281-81a1-4d3c-ba10-0593a886779d
> Traceback (most recent call last):
>   File "./spark_ec2.py", line 860, in 
> main()
>   File "./spark_ec2.py", line 852, in main
> real_main()
>   File "./spark_ec2.py", line 735, in real_main
> conn, opts, cluster_name)
>   File "./spark_ec2.py", line 247, in launch_cluster
> master_group = get_or_make_group(conn, cluster_name + "-master")
>   File "./spark_ec2.py", line 143, in get_or_make_group
> return conn.create_security_group(name, "Spark EC2 group")
>   File 
> "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py",
>  line 2011, in create_security_group
>   File 
> "/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py",
>  line 925, in get_object
> boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
> 
> VPCIdNotSpecifiedNo default 
> VPC for this 
> user312a2281-81a1-4d3c-ba10-0593a886779d



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122114#comment-14122114
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Looks like simply moving the code that generates the secret and sets in the UGI 
to the Client class should take care of that. 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122103#comment-14122103
 ] 

Hari Shreedharan commented on SPARK-3129:
-

I am less worried about client mode, since most streaming applications would 
run in cluster mode. We can make this available only in the cluster mode.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3378) Replace the word "SparkSQL" with right word "Spark SQL"

2014-09-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3378.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Replace the word "SparkSQL" with right word "Spark SQL"
> ---
>
> Key: SPARK-3378
> URL: https://issues.apache.org/jira/browse/SPARK-3378
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Trivial
> Fix For: 1.2.0
>
>
> In programming-guide.md, there are 2 "SparkSQL". We should use "Spark SQL" 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122064#comment-14122064
 ] 

Thomas Graves commented on SPARK-3129:
--

On yarn, it generates the secret automatically.  In cluster mode, it does it in 
the applicationMaster.  Since it generates it in the applicationmaster, it goes 
away when the application master dies.   If the secret was generated on the 
client side and populated into the credentials in the UGI similar to how we do 
tokens then a restart of the AM in cluster mode should be able to pick it back 
up.  

This won't work for client mode though since the client/spark driver wouldn't 
have a way to get ahold of the UGI again.  

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122017#comment-14122017
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - Am I correct in assuming that using Akka automatically gives the 
shared secret authentication if spark.authenticate is set to true - if the AM 
is restarted by YARN itself (since it is the same application, it theoretically 
has access to the same shared secret and thus should be able to communicate via 
Akka)? 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122006#comment-14122006
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Yes, so my initial goal is to be able to recover all the blocks that have not 
been made into an RDD yet (at which point it would be safe). There is data 
which may not have become a block yet (which are created using the += operator) 
- for now, I am going to call it fair game to say that we are going to be 
adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that 
store data such that they are guaranteed to be recovered.

At a later stage, we could use something like a WAL on HDFS to recover even the 
+= data, though that would affect performance.



> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3405) EC2 cluster creation on VPC

2014-09-04 Thread Dawson Reid (JIRA)

Dawson Reid created SPARK-3405:
--

 Summary: EC2 cluster creation on VPC
 Key: SPARK-3405
 URL: https://issues.apache.org/jira/browse/SPARK-3405
 Project: Spark
  Issue Type: New Feature
  Components: EC2, PySpark
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04
Reporter: Dawson Reid
Priority: Minor


It would be very useful to be able to specify the EC2 VPC in which the Spark 
cluster should be created. 

When creating a Spark cluster on AWS via the spark-ec2 script there is no way 
to specify a VPC id of the VPC you would like the cluster to be created in. The 
script always creates the cluster in the default VPC. 

In my case I have deleted the default VPC and the spark-ec2 script errors out 
with the following : 

Setting up security groups...
Creating security group test-master
ERROR:boto:400 Bad Request
ERROR:boto:
VPCIdNotSpecifiedNo default VPC 
for this 
user312a2281-81a1-4d3c-ba10-0593a886779d
Traceback (most recent call last):
  File "./spark_ec2.py", line 860, in 
main()
  File "./spark_ec2.py", line 852, in main
real_main()
  File "./spark_ec2.py", line 735, in real_main
conn, opts, cluster_name)
  File "./spark_ec2.py", line 247, in launch_cluster
master_group = get_or_make_group(conn, cluster_name + "-master")
  File "./spark_ec2.py", line 143, in get_or_make_group
return conn.create_security_group(name, "Spark EC2 group")
  File 
"/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py",
 line 2011, in create_security_group
  File 
"/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py",
 line 925, in get_object
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request

VPCIdNotSpecifiedNo default VPC 
for this 
user312a2281-81a1-4d3c-ba10-0593a886779d



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121883#comment-14121883
 ] 

Matei Zaharia commented on SPARK-640:
-

[~pwendell] what is our Hadoop 1 version on AMIs now?

> Update Hadoop 1 version to 1.1.0 (especially on AMIs)
> -
>
> Key: SPARK-640
> URL: https://issues.apache.org/jira/browse/SPARK-640
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>
> Hadoop 1.1.0 has a fix to the notorious "trailing slash for directory objects 
> in S3" issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be 
> good to support on the AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark

2014-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2334:
--
Affects Version/s: 1.1.0

> Attribute Error calling PipelinedRDD.id() in pyspark
> 
>
> Key: SPARK-2334
> URL: https://issues.apache.org/jira/browse/SPARK-2334
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Diana Carroll
>
> calling the id() function of a PipelinedRDD causes an error in PySpark.  
> (Works fine in Scala.)
> The second id() call here fails, the first works:
> {code}
> r1 = sc.parallelize([1,2,3])
> r1.id()
> r2=r1.map(lambda i: i+1)
> r2.id()
> {code}
> Error:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 r2.id()
> /usr/lib/spark/python/pyspark/rdd.py in id(self)
> 180 A unique ID for this RDD (within its SparkContext).
> 181 """
> --> 182 return self._id
> 183 
> 184 def __repr__(self):
> AttributeError: 'PipelinedRDD' object has no attribute '_id'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3061) Maven build fails in Windows OS

2014-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-3061:
-

Assignee: Andrew Or  (was: Josh Rosen)

Re-assigning to Andrew, who's going to backport it.

> Maven build fails in Windows OS
> ---
>
> Key: SPARK-3061
> URL: https://issues.apache.org/jira/browse/SPARK-3061
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.2.0
>
>
> Maven build fails in Windows OS with this error message.
> {noformat}
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
> (default) on project spark-core_2.10: Command execution failed. Cannot run 
> program "unzip" (in directory "C:\path\to\gitofspark\python"): CreateProcess 
> error=2, w肳ꂽt@ -> [Help 1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2015) Spark UI issues at scale

2014-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2015:
--
Component/s: Web UI

> Spark UI issues at scale
> 
>
> Key: SPARK-2015
> URL: https://issues.apache.org/jira/browse/SPARK-2015
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>
> This is an umbrella ticket for issues related to Spark's web ui when we run 
> Spark at scale (large datasets, large number of machines, or large number of 
> tasks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows

2014-09-04 Thread Pravesh Jain (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravesh Jain updated SPARK-3284:

Description: 
{code}
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p
 => Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")

val parquetFile = 
sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
  }
}
{code}

gives the error



Exception in thread "main" java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.

This works fine in linux but using in eclipse in windows gives the error.

  was:
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p
 => Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")

val parquetFile = 
sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
  }
}

gives the error



Exception in thread "main" java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.

This works fine in linux but using in eclipse in windows gives the error.


> saveAsParquetFile not working on windows
> 
>
> Key: SPARK-3284
> URL: https://issues.apache.org/jira/browse/SPARK-3284
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Windows
>Reporter: Pravesh Jain
>Priority: Minor
>
> {code}
> object parquet {
>   case class Person(name: String, age: Int)
>   def main(args: Array[String]) {
> val sparkConf = new 
> SparkConf().setMaster("local").setAppName("HdfsWordCount")
> val sc = new SparkContext(sparkConf)
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
> import sqlContext.createSchemaRDD
> val people = 
> sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p
>  => Person(p(0), p(1).trim.toInt))
> 
> people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
> val parquetFile = 
> sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
>   }
> }
> {code}
> gives the error
> Exception in thread "main" java.lang.NullPointerException at 
> org.apache.spark.parquet$.main(parquet.scala:16)
> which is the line saveAsParquetFile.
> This works fine in linux but using in eclipse in windows gives the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3286:
--
Component/s: Web UI

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1078) Replace lift-json with json4s-jackson

2014-09-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1078.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

It looks like this was fixed in SPARK-1132 / Spark 1.0.0, where we migrated to 
json4s.jackson. 

> Replace lift-json with json4s-jackson
> -
>
> Key: SPARK-1078
> URL: https://issues.apache.org/jira/browse/SPARK-1078
> Project: Spark
>  Issue Type: Task
>  Components: Deploy, Web UI
>Affects Versions: 0.9.0
>Reporter: William Benton
>Priority: Minor
> Fix For: 1.0.0
>
>
> json4s-jackson is a Jackson-backed implementation of the Json4s common JSON 
> API for Scala JSON libraries.  (Evan Chan has a nice comparison of Scala JSON 
> libraries here:  
> http://engineering.ooyala.com/blog/comparing-scala-json-libraries)  It is 
> Apache-licensed, mostly API-compatible with lift-json, and easier for 
> downstream operating system distributions to consume than lift-json.
> In terms of performance, json4s-jackson is slightly slower but comparable to 
> lift-json on my machine when parsing very small JSON files (< 2kb and < ~30 
> objects), around 40% faster than lift-json on medium-sized files (~50kb), and 
> significantly (~10x) faster on multi-megabyte files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121653#comment-14121653
 ] 

Sean Owen commented on SPARK-3404:
--

It's 100% repeatable in Maven for me locally, which seems to be Jenkins' 
experience too. I don't see the same problem with SBT (/dev/run-tests) locally, 
although I can't say I run that regularly.

I could rewrite the SparkSubmitSuite to submit a JAR file that actually 
contains the class it's trying to invoke. Maybe that's smarter? the problem 
here seems to be the vagaries of what the run-time classpath is during an SBT 
vs Maven test. Would anyone second that?

Separately it would probably not hurt to get in that change that logs stdout / 
stderr from the Utils method.

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>Priority: Critical
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/a

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Affects Version/s: 1.1.0

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>Priority: Critical
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
>  This is for example SimpleApplicationTest 
> (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
> The test actually submits an empty JAR not containing this class. It relies 
> on {{spark-submit}} finding the class within the compiled test-classes of the 
> Spark project. However it does seem to be compiled and present even with 
> Maven.
> If modified to print stdout and stderr, and dump the actual command, I see an 
> empty stdout, and only

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121650#comment-14121650
 ] 

Andrew Or commented on SPARK-3404:
--

I have updated the title to reflect this.

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>Priority: Critical
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
>  This is for example SimpleApplicationTest 
> (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
> The test actually submits an empty JAR not containing this class. It relies 
> on {{spark-submit}} finding the class within the compiled test-classes of the 
> Spark project. However it does seem to be compiled and present even with 
> Maven.
> If modified to print stdout and s

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Target Version/s: 1.1.1

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>Priority: Critical
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
>  This is for example SimpleApplicationTest 
> (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
> The test actually submits an empty JAR not containing this class. It relies 
> on {{spark-submit}} finding the class within the compiled test-classes of the 
> Spark project. However it does seem to be compiled and present even with 
> Maven.
> If modified to print stdout and stderr, and dump the actual command, I see an 
> empty stdout, and only t

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Summary: SparkSubmitSuite fails with "spark-submit exits with code 1"  
(was: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1)

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
>  This is for example SimpleApplicationTest 
> (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
> The test actually submits an empty JAR not containing this class. It relies 
> on {{spark-submit}} finding the class within the compiled test-classes of the 
> Spark project. However it does seem to be compiled and present even with 
> Maven.
> If modifie

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with "spark-submit exits with code 1"

2014-09-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Priority: Critical  (was: Major)

> SparkSubmitSuite fails with "spark-submit exits with code 1"
> 
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Sean Owen
>Priority: Critical
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com/apache/spark/pull/2108/files and 
> https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
> least print stdout to the log too.
> The SparkSubmit program exits with 1 when the main class it is supposed to 
> run is not found 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
>  This is for example SimpleApplicationTest 
> (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
> The test actually submits an empty JAR not containing this class. It relies 
> on {{spark-submit}} finding the class within the compiled test-classes of the 
> Spark project. However it does seem to be compiled and present even with 
> Maven.
> If modified to print stdout and stderr, and dump the actual command, I see an 
> empty stdout, a

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121646#comment-14121646
 ] 

Andrew Or commented on SPARK-3404:
--

Thanks for looking into this Sean. Does this happen all the time or only once 
in a while? We have observed the same tests failing on our Jenkins, which runs 
the test through sbt. The behavior is consistent with running it through maven. 
If we run it through 'sbt test-only SparkSubmitSuite' then it always passes, 
but if we run 'sbt test' then sometimes it fails.

This has also been failing for a while for sbt. Very roughly I remember we 
began seeing it after https://github.com/apache/spark/pull/1777 went in. Though 
I have gone down that path to debug any possibilities of port collision to no 
avail. A related test failure is in DriverSuite, which also calls 
`Utils.executeAndGetOutput`. Have you seen that failing in maven?

I will keep investigating it in parallel for sbt, though I suspect the root 
cause is the same. Let me know if you find anything.

> SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
> ---
>
> Key: SPARK-3404
> URL: https://issues.apache.org/jira/browse/SPARK-3404
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2
>Reporter: Sean Owen
>
> Maven-based Jenkins builds have been failing for over a month. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
> It's SparkSubmitSuite that fails. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
> {code}
> SparkSubmitSuite
> ...
> - launch simple application with spark-submit *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
> local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> - spark submit includes jars passed in through --jar *** FAILED ***
>   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
> org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
> local-cluster[2,1,512], --jars, 
> file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
>  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   ...
> {code}
> SBT builds don't fail, so it is likely to be due to some difference in how 
> the tests are run rather than a problem with test or core project.
> This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
> cause identified in that JIRA is, at least, not the only cause. (Although, it 
> wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
> config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)
> This JIRA tracks investigation into a different cause. Right now I have some 
> further information but not a PR yet.
> Part of the issue is that there is no clue in the log about why 
> {{spark-submit}} exited with status 1. See 
> https://github.com

[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121608#comment-14121608
 ] 

Apache Spark commented on SPARK-3286:
-

User 'benoyantony' has created a pull request for this issue:
https://github.com/apache/spark/pull/2276

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121561#comment-14121561
 ] 

Helena Edelson edited comment on SPARK-2892 at 9/4/14 5:01 PM:
---

I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver 
for stream" seems like the expected behavior.


DEBUG 13:00:22,418 Stopping JobScheduler
 INFO 13:00:22,441 Received stop signal
 INFO 13:00:22,441 Sent stop signal to all 1 receivers
 INFO 13:00:22,442 Stopping receiver with message: Stopped by driver: 
 INFO 13:00:22,442 Called receiver onStop
 INFO 13:00:22,443 Deregistering receiver 0
ERROR 13:00:22,445 Deregistered receiver for stream 0: Stopped by driver
 INFO 13:00:22,445 Stopped receiver 0


was (Author: helena_e):
I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver 
for stream" seems like the expected behavior.

> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121563#comment-14121563
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Yes, I tried using netlib-java separately with the same OpenBLAS setup and it 
worked properly, even within several threads. However I didn't mimic the same 
multi-threading setup as MLlib has because it is complicated.  Do you want me 
to send you all DLLs that I used? I had troubles with compiling OpenBLAS for 
Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites.


> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121561#comment-14121561
 ] 

Helena Edelson commented on SPARK-2892:
---

I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and " Deregistered receiver 
for stream" seems like the expected behavior.

> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1

2014-09-04 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3404:


 Summary: SparkSubmitSuite fails in Maven (only) - spark-submit 
exits with code 1
 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen


Maven-based Jenkins builds have been failing for over a month. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/

It's SparkSubmitSuite that fails. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull

{code}
SparkSubmitSuite
...
- launch simple application with spark-submit *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- spark submit includes jars passed in through --jar *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
local-cluster[2,1,512], --jars, 
file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
 file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
{code}

SBT builds don't fail, so it is likely to be due to some difference in how the 
tests are run rather than a problem with test or core project.

This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
cause identified in that JIRA is, at least, not the only cause. (Although, it 
wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
config to invoke {{mvn clean && mvn ... package}} {{mvn ... clean package}}.)

This JIRA tracks investigation into a different cause. Right now I have some 
further information but not a PR yet.

Part of the issue is that there is no clue in the log about why 
{{spark-submit}} exited with status 1. See 
https://github.com/apache/spark/pull/2108/files and 
https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
least print stdout to the log too.

The SparkSubmit program exits with 1 when the main class it is supposed to run 
is not found 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
 This is for example SimpleApplicationTest 
(https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)

The test actually submits an empty JAR not containing this class. It relies on 
{{spark-submit}} finding the class within the compiled test-classes of the 
Spark project. However it does seem to be compiled and present even with Maven.

If modified to print stdout and stderr, and dump the actual command, I see an 
empty stdout, and only the command to stderr:

{code}
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp 
null::/Users/srowen/Documents/spark/conf:/Users/srowen/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar:/Users/srowen/Documents/spark/core/target/scala-2.10/test-classes:/Users/srowen/Documents/spark/repl

[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121516#comment-14121516
 ] 

Xiangrui Meng commented on SPARK-3403:
--

Did you test the setup of netlib-java with OpenBLAS? I hit a JNI issue (a year 
ago, maybe fixed) with netlib-java and multithreading OpenBLAS. Could you try 
compiling OpenBLAS with `USE_THREAD=0`? If it still doesn't work, please attach 
the driver/executor logs. Thanks!

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-3403:

Attachment: NativeNN.scala

The file contains example that produces the same issue

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Fix For: 1.1.0
>
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

Alexander Ulanov created SPARK-3403:
---

 Summary: NaiveBayes crashes with blas/lapack native libraries for 
breeze (netlib-java)
 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0


Code:
val model = NaiveBayes.train(train)
val predictionAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}
predictionAndLabels.foreach(println)

Result: 
program crashes with: "Process finished with exit code -1073741819 
(0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3375) spark on yarn container allocation issues

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121451#comment-14121451
 ] 

Apache Spark commented on SPARK-3375:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2275

> spark on yarn container allocation issues
> -
>
> Key: SPARK-3375
> URL: https://issues.apache.org/jira/browse/SPARK-3375
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> It looks like if yarn doesn't get the containers immediately it stops asking 
> for them and the yarn application hangs with never getting any executors.  
> This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3375) spark on yarn container allocation issues

2014-09-04 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-3375:


Assignee: Thomas Graves

> spark on yarn container allocation issues
> -
>
> Key: SPARK-3375
> URL: https://issues.apache.org/jira/browse/SPARK-3375
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> It looks like if yarn doesn't get the containers immediately it stops asking 
> for them and the yarn application hangs with never getting any executors.  
> This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121324#comment-14121324
 ] 

Helena Edelson edited comment on SPARK-2892 at 9/4/14 1:12 PM:
---

I see the same with 1.0.2 streaming, with or without stopGracefully = true

ssc.stop(stopSparkContext = false, stopGracefully = true)

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))



was (Author: helena_e):
I see the same with 1.0.2 streaming:

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))


> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121324#comment-14121324
 ] 

Helena Edelson commented on SPARK-2892:
---

I see the same with 1.0.2 streaming:

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 -> 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))


> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread David (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121276#comment-14121276
 ] 

David commented on SPARK-1473:
--

Hi you all,

I am Dr. David Martinez and this is my first comment of this project. We 
implemented all feature selection methods included in
•Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.The Journal of Machine Learning Research, 13, 27-66

included more optimizations and left the framework open to include more 
criteria. We opened a pull request in the past but did not finished it. You can 
have a look in our github
https://github.com/LIDIAgroup/SparkFeatureSelection
We would like to finish our pull request

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3402) Library for Natural Language Processing over Spark.

2014-09-04 Thread Nagamallikarjuna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121190#comment-14121190
 ] 

Nagamallikarjuna commented on SPARK-3402:
-

We have gone through Spark and its family, we didn't find any natural language 
processing library over spark. We (Impetus) are working to implement some 
natural language features over Spark. We already developed some working 
algorithms library using OpenNLP tool kit, and will extend to other NLP tool 
kits like Stanford, CTakes, NLTK etc.. We are planning to contribute our work 
to existing MLLib or new sub project.


Thanks
Naga

> Library for Natural Language Processing over Spark.
> ---
>
> Key: SPARK-3402
> URL: https://issues.apache.org/jira/browse/SPARK-3402
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Nagamallikarjuna
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3402) Library for Natural Language Processing over Spark.

2014-09-04 Thread Nagamallikarjuna (JIRA)

Nagamallikarjuna created SPARK-3402:
---

 Summary: Library for Natural Language Processing over Spark.
 Key: SPARK-3402
 URL: https://issues.apache.org/jira/browse/SPARK-3402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Nagamallikarjuna
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-04 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121173#comment-14121173
 ] 

Chengxiang Li commented on SPARK-2321:
--

I'm not sure whether i understand you right, here is my thought about the API 
design:
# The JobStatus/JobStatistic API only contains getter method.
# JobProgressListener contains variables of JobStatusImpl/JobStatisticImpl.
# DagScheduler post events to JobProgressListener through listener bus.
# Caller get JobStatusImpl/JobStatisticImpl from JobProgressListener with 
updated state.

So i think it should be a pull style API.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121160#comment-14121160
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan], one more question:

Is your design goal trying to fix the receiver node failure caused data loss 
issue? Seems potentially data will be lost when data is only stored in 
BlockGenerator not yet in BM when node is failed. Your design doc mainly 
focused on driver failure, so what's your thought?

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options

2014-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-529.
-
Resolution: Won't Fix

This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, 
and I suppose there is spark-env.sh too.

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-640.
-
Resolution: Fixed

This looks stale right? Hadoop 1 version has been at 1.2.1 for some time.

> Update Hadoop 1 version to 1.1.0 (especially on AMIs)
> -
>
> Key: SPARK-640
> URL: https://issues.apache.org/jira/browse/SPARK-640
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>
> Hadoop 1.1.0 has a fix to the notorious "trailing slash for directory objects 
> in S3" issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be 
> good to support on the AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications

2014-09-04 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3377:
--
Summary: Don't mix metrics from different applications  (was: codahale base 
Metrics data between applications can jumble up together)

> Don't mix metrics from different applications
> -
>
> Key: SPARK-3377
> URL: https://issues.apache.org/jira/browse/SPARK-3377
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Critical
>
> I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I 
> saw following 2 problems.
> (1) When applications which have same spark.app.name run on cluster at the 
> same time, some metrics names jumble up together. e.g, 
> SparkPi.DAGScheduler.stage.failedStages jumble.
> (2) When 2+ executors run on the same machine, JVM metrics of each executors 
> jumble. e.g, We current implementation cannot distinguish metric "jvm.memory" 
> is for which executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121115#comment-14121115
 ] 

Apache Spark commented on SPARK-2978:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2274

> Provide an MR-style shuffle transformation
> --
>
> Key: SPARK-2978
> URL: https://issues.apache.org/jira/browse/SPARK-2978
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide a transformation with the 
> semantics of the Hadoop MR shuffle, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-04 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121094#comment-14121094
 ] 

Reynold Xin commented on SPARK-2321:


What about pull vs push? i.e. should this be a listener like API, or some 
service with states that the caller can poll to ask?

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-04 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121085#comment-14121085
 ] 

Chengxiang Li commented on SPARK-2321:
--

I collect some hive side requirement here, which should be helpful for spark 
job status and statistic API design.

Hive should be able to get the following job status information through Spark 
job status API.
1. job identifier
2. current job execution state, should include RUNNING/SUCCEEDED/FAILED/KILLED.
3. running/failed/killed/total task number on job level.
4. stage identifier
5. stage state, should include RUNNING/SUCCEEDED/FAILED/KILLED
6. running/failed/killed/total task number on stage level.

MR/Tez use Counter to collect statistic information, similiar to MR/Tez 
Counter, it would be better if Spark job statistic API organize statistic 
information with:
1. group same kind statistic information by groupName.
2. displayName for both group and statistic information which would uniform 
print string for frontend(Web UI/Hive CLI/...).


> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)

2014-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121075#comment-14121075
 ] 

Apache Spark commented on SPARK-3353:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2273

> Stage id monotonicity (parent stage should have lower stage id)
> ---
>
> Key: SPARK-3353
> URL: https://issues.apache.org/jira/browse/SPARK-3353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>
> The way stage IDs are generated is that parent stages actually have higher 
> stage id. This is very confusing because parent stages get scheduled & 
> executed first.
> We should reverse that order so the scheduling timeline of stages (absent of 
> failures) is monotonic, i.e. stages that are executed first have lower stage 
> ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

71 matches

Mail list logo