[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394147#comment-14394147 ] Florian Verhein commented on SPARK-6664: I guess the other thing is - we can union RDDs, so why not be able to 'undo' that? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394141#comment-14394141 ] Florian Verhein commented on SPARK-6664: Thanks [~sowen]. I disagree :-) ...If you think there's non-stationarity you most certainly want to see how well a model trained in the past holds up in the future (possibly with more than one out of time sample if one is used for pruning, etc), and you can do this for temporal data by adjusting the way you do cross validation... actually, the exact method you describe is one common approach in time series data, e.g. see http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection Doing this multiple times does exactly what is does for normal cross-validation - gives you a distribution of your error estimate, rather than a single value (a sample of it). So it's quite important. The size of the data isn't really relevant to this argument (also consider that I might like to employ larger datasets to remove the risk of overfitting a more complex but better fitting model, rather than to improve my error estimates). Note that this proposal doesn't define how the split RDDs are used (i.e. unioned) to create training sets and test sets. So the test set can be a single RDD, or multiple ones. It's entirely up to the user. Allowing overlapping partitions (i.e. part 2) is a little different, because you probably wouldn't union the resulting RDDs due to duplication. It would be more useful for as a primitive for bootstrapping the performance measures of streaming models or simulations (so, you're not resampling records, but resampling subsequences). Alternatively if you have big data but a class imbalance problem, you might need to resort to overlaps in the training sets to get multiple test sets with enough examples of your minority class. From what I understand MLUtils.kFold is standard randomised k-fold cross validation *but without shuffling* (from a cursory look at the code, It looks like ordering will always be maintained... which should probably be documented if it is the case because it can lead to bad things... and adds another argument for #6665). Either way, since elements of its splits are non-consecutive, it's not applicable for time series. Do you know how the performance of filterByRange would compare? It should be pretty performant if and only if the data is RangePartitioned right? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2*
[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD
[ https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394291#comment-14394291 ] Florian Verhein commented on SPARK-6665: Fair enough. I'll have to implement it because I need it so may as well report back when I've had the chance to (perhaps there's a better place for it - e.g. not in the core API). Randomly Shuffle an RDD Key: SPARK-6665 URL: https://issues.apache.org/jira/browse/SPARK-6665 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Florian Verhein Priority: Minor *Use case* RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g. - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html *Possible implementation* As mentioned by [~sowen] in the above thread, could sort by( a good hash of( the element (or key if it's paired) and a random salt)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD
[ https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394089#comment-14394089 ] Florian Verhein commented on SPARK-6665: Thanks for the quick response [~sowen]. Agree with your observation, but consider a) distributing the cross validation itself (so one job will achieve all the training and scoring on the k fold selections) and b) using the pre-processed and shuffled data for non-spark modelling, such as in R or python or vowpal wabbit (perhaps all running within spark jobs, using something like sc.paralellise(jobs,jobs.size).map(_()) to treat spark as a grid). So if the splits already exist on hdfs it is very easy to use them -- and since you can control the number of partitions easily this gives a very simple way to quickly get something up and running in R or python, even if the data is big. But this is really just a nice data science hacking side effect of this feature, rather than a driving use case. I don't really agree that taking random subsamples is better because you run the risk of never selecting some instances. Agree that the most important use case is random order for subsequent serial access (but disagree it's limited to small RDDs). For example, if you use spark for pre-processing followed by a large scale learner like vowpal wabbit (note that vw has features that mllib SGD doesn't have yet), the data should be shuffled since vw processes out of core, so cannot perform the randomisation itself through order selection (and it would really slow down the algorithm if it did). It's worth pointing out that shuffling a dataset is a common enough operation for it to exists in other big data frameworks - e.g. I've used it in ML pipelines written in Scoobi and Scalding. I haven't implemented it myself, but I'm pretty sure it's non trivial to make it performant and have good randomness properties. So I think there's a good case to add it. Randomly Shuffle an RDD Key: SPARK-6665 URL: https://issues.apache.org/jira/browse/SPARK-6665 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Florian Verhein Priority: Minor *Use case* RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g. - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html *Possible implementation* As mentioned by [~sowen] in the above thread, could sort by( a good hash of( the element (or key if it's paired) and a random salt)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6664: --- Description: I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? was: I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. Use case example: suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. Specification: 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals, and return the RDDs containing values within those intervals. Implementation ideas / notes for 1: - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. Implementation ideas / notes for 2: This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(?), the decorator idea should still work? Thoughts? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664
[jira] [Created] (SPARK-6665) Randomly Shuffle an RDD
Florian Verhein created SPARK-6665: -- Summary: Randomly Shuffle an RDD Key: SPARK-6665 URL: https://issues.apache.org/jira/browse/SPARK-6665 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Florian Verhein Priority: Minor *Use case* RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g. - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html *Possible implementation* As mentioned by [~sowen] in the above thread, could sort by( a good hash of( the element (or key if it's paired) and a random salt)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391950#comment-14391950 ] Florian Verhein commented on SPARK-6664: The closest approach I've found that should achieve the same result is calling OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be much slower, but... it may not be if it's completely lazy.. (??). I don't know spark well enough yet to be anywhere near sure of this. Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
Florian Verhein created SPARK-6664: -- Summary: Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. Use case example: suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. Specification: 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals, and return the RDDs containing values within those intervals. Implementation ideas / notes for 1: - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. Implementation ideas / notes for 2: This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(?), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] was: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
Florian Verhein created SPARK-6601: -- Summary: Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html Open ports in spark-ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: Open ports in spark-ec2.py to allow HDFS NFS gateway ) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 was: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)
[ https://issues.apache.org/jira/browse/SPARK-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328612#comment-14328612 ] Florian Verhein commented on SPARK-5879: cc [~shivaram], any opinions on how to best do this? spary_ec2.py should expose/return master and slave lists (e.g. write to file) - Key: SPARK-5879 URL: https://issues.apache.org/jira/browse/SPARK-5879 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein After running spark_ec2.py, it is often useful/necessary to know the master's ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline. For example, consider a wrapper that launches a cluster, then waits for completion of some application running on it (e.g. polling via ssh), before destroying the cluster. Some options: - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a subset of the ec2_variables.sh that is temporarily created as part of deploy_files variable substitution) - launch-variables.json (same info but as json) Both would be useful depending on the wrapper language. I think we should incorporate the cluster name for the case that multiple clusters are launched. E.g. cluster_name_variables.sh/.json Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)
Florian Verhein created SPARK-5879: -- Summary: spary_ec2.py should expose/return master and slave lists (e.g. write to file) Key: SPARK-5879 URL: https://issues.apache.org/jira/browse/SPARK-5879 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein After running spark_ec2.py, it is often useful/necessary to know the master's ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline. For example, consider a wrapper that launches a cluster, then waits for completion of some application running on it (e.g. polling via ssh), before destroying the cluster. Some options: - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a subset of the ec2_variables.sh that is temporarily created as part of deploy_files variable substitution) - launch-variables.json (same info but as json) Both would be useful depending on the wrapper language. I think we should incorporate the cluster name for the case that multiple clusters are launched. E.g. cluster_name_variables.sh/.json Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate
[ https://issues.apache.org/jira/browse/SPARK-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324986#comment-14324986 ] Florian Verhein commented on SPARK-5851: That makes sense. Yeah, I ran into it yesterday. My spark-ec2/setup.sh failed (had set -u set on a new component I was testing), resulting in looping over setup.sh calls. In this case, spark_ec2.py shouldn't retry, but fail gracefully (ideally after performing cleanup of the cluster, and returning a failure code) spark_ec2.py ssh failure retry handling not always appropriate -- Key: SPARK-5851 URL: https://issues.apache.org/jira/browse/SPARK-5851 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein Priority: Minor The following function doesn't distinguish between the ssh failing (e.g. presumably a connection issue) and the remote command that it executes failing (e.g. setup.sh). The latter should probably not result in a retry. Perhaps tries could be an argument that is set to 1 for certain usages. # Run a command on a host through ssh, retrying up to five times # and then throwing an exception if ssh continues to fail. spark-ec2: [{{def ssh(host, opts, command)}}|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L953-L975] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate
Florian Verhein created SPARK-5851: -- Summary: spark_ec2.py ssh failure retry handling not always appropriate Key: SPARK-5851 URL: https://issues.apache.org/jira/browse/SPARK-5851 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein Priority: Minor The following function doesn't distinguish between the ssh failing (e.g. presumably a connection issue) and the remote command that it executes failing (e.g. setup.sh). The latter should probably not result in a retry. Perhaps tries could be an argument that is set to 1 for certain usages. # Run a command on a host through ssh, retrying up to five times # and then throwing an exception if ssh continues to fail. def ssh(host, opts, command): -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322611#comment-14322611 ] Florian Verhein commented on SPARK-5813: I think it's a good idea to stick to vendor recommendations, but since I can't point to any concrete benefits and there is complexity around handling licensing issues, I don't think there's a good argument for tackling this. Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein closed SPARK-5813. -- Resolution: Won't Fix Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321764#comment-14321764 ] Florian Verhein commented on SPARK-5813: INAL but here are my thoughts: The user ends up downloading it from Oracle and accepting the license terms in that process, so as long as they are (or made) aware then I don't really see a problem. It's just providing a mechanism for them to do this. i.e. It's not a redistribution issue. I think a reasonable solution to this would be to have OpenJDK as the default, with OracleJDK as an option that the user must specifically request (and the option's documentation indicating that this entails acceptance of a license... etc) At least, *the above is true in the case where the user builds their own AMI (that's the approach I take since it best suits my requirements). With provided AMIs I think this is more complex, because I would assume that is redistribution*. I guess that applies to any software that is put on the AMI actually... so this may be an issue that needs looking at more generally... I don't know how to best approach that case other than adhering to any redistribution terms including these as part of an EULA for spark-ec2/AMIs or something? But with the work [~nchammas] has done, I suppose the easiest way would be to provide the public AMIs with OpenJDK, and add an option to build ones with OracleJDK if the user is inclined to do this themselves. Hmmm... is this worthwhile? Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322208#comment-14322208 ] Florian Verhein commented on SPARK-5813: Good point. I think you're right re: scripting away - I understand it's sometimes done by sysadmins/ops to automate their installation processes in-house, but that is a different situation. Thanks for that. spark_ec2 works by looking up an existing ami and using it to instantiate ec2 instances. I don't know who currently maintains these. Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321748#comment-14321748 ] Florian Verhein commented on SPARK-5813: No specific technical reason esp WRT Spark... It's more of an attempt to keep in line with recommendations for Hadoop in production (relevant since hadoop is included in spark-ec2 - and cdh seems to be favoured). For example, CDH supports OracleJDK, Horton didn't support OpenJDK before 1.7 and OracleJDK still seems to be the favoured choice in production deployments, e.g. http://wiki.apache.org/hadoop/HadoopJavaVersions. I don't have first had data about how they compare performance wise. I've heard OracleJDK being preferred for Hadoop on that front, but I also found this http://www.slideshare.net/PrincipledTechnologies/big-data-technology-on-red-hat-enterprise-linux-openjdk-vs-oracle-jdk, so perhaps performance is less of a reason these days? Do you know of any performance analysis done with Spark, Tachyon on OpenJDK vs OracleJDK? In terms of difficulty, it's not hard to script installation of OracleJDK. E.g. I've gone down the path of supporting both for the above reasons here (link may break in future): https://github.com/florianverhein/spark-ec2/blob/packer/packer/java-setup.sh Aside: Based on bugs you mentioned, is there a list somewhere of which JDK versions to avoid WRT Spark? Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320995#comment-14320995 ] Florian Verhein commented on SPARK-3821: RE: Java, that reminds me... We should probably be using OracleJDK rather than OpenJDK. But I think this should be a separate issue, so just created #SPARK-5813. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5813) Spark-ec2: Switch to OracleJDK
Florian Verhein created SPARK-5813: -- Summary: Spark-ec2: Switch to OracleJDK Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster
[ https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-5641: --- Description: *Updated - no longer via deploy.generic, no substitutions* Essentially, give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master). Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. However note that it could also be used to override / add to either: - what's on the image - what gets cloned from spark-ec2 (e.g. add new module) was: Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. However note that it could also be used to override either: - what's on the image - what gets cloned from spark-ec2 (since deploy_files runs afterwards) The idea is that the user can just dump the files into ec2/deploy.generic/. This can be implemented by modifying deploy_files so that it simply copies the file (if it is of certain types), rather than treating it as a text file and attempting to replace template variables. Detecting binary files is non-trivial. So the proposal is to have a list of file extensions that will trigger simple file copying. Allow spark_ec2.py to copy arbitrary files to cluster - Key: SPARK-5641 URL: https://issues.apache.org/jira/browse/SPARK-5641 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor *Updated - no longer via deploy.generic, no substitutions* Essentially, give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master). Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. However note that it could also be used to override / add to either: - what's on the image - what gets cloned from spark-ec2 (e.g. add new module) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic
[ https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-5641: --- Description: Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. However note that it could also be used to override either: - what's on the image - what gets cloned from spark-ec2 (since deploy_files runs afterwards) The idea is that the user can just dump the files into ec2/deploy.generic/. This can be implemented by modifying deploy_files so that it simply copies the file (if it is of certain types), rather than treating it as a text file and attempting to replace template variables. Detecting binary files is non-trivial. So the proposal is to have a list of file extensions that will trigger simple file copying. was: Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. Could also be used to override what's on the image, etc. The idea is that the user can just dump the files into deploy.generic. This can be implemented by modifying deploy_templates so that it simply copies the file (if it is of certain types), rather than treating it as a text file and replacing template variables. Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic Key: SPARK-5641 URL: https://issues.apache.org/jira/browse/SPARK-5641 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. However note that it could also be used to override either: - what's on the image - what gets cloned from spark-ec2 (since deploy_files runs afterwards) The idea is that the user can just dump the files into ec2/deploy.generic/. This can be implemented by modifying deploy_files so that it simply copies the file (if it is of certain types), rather than treating it as a text file and attempting to replace template variables. Detecting binary files is non-trivial. So the proposal is to have a list of file extensions that will trigger simple file copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102 ] Florian Verhein edited comment on SPARK-5676 at 2/9/15 11:06 PM: - [~srowen] Yep, that's the one. True. However it is the key part in providing the functionality of spark deployment on EC2, which is documented quite prominently on the Spark site, and the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against this functionality are therefore also filed here under EC2 module. I assume the decision to have a separate repo was for implementation/design reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed on EC2 is a really nice way of providing the functionality. But that's an assumption on my part and [~shivaram] would know best. So from a user perspective, it would appear to be part of Spark (users may not even be aware that part of the functionality lives in a separate repo). Since it's a great way to get Spark running on EC2, it would be great to get the licencing sorted out. This appears to be the best place to raise this issue. was (Author: florianverhein): [~srowen] Yep, that's the one. True. However it is the key part in providing the functionality of spark deployment on EC2, which is documented quite prominently on the Spark site, and the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against this functionality are therefore also filed here under EC2 module. I assume the decision to have a separate repo was for implementation/design reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed on EC2 is a really nice way of providing the functionality. But that's an assumption on my part and [~shivaram] would know best. So from a user perspective, it would appear to be part of Spark (users may not even be aware that part of the functionality lives in a separate repo). Since it's a great way to get Spark running on EC2, it would be great to get the licencing sorted out. This appears to be the best place to raise this issue. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102 ] Florian Verhein commented on SPARK-5676: [~srowen] Yep, that's the one. True. However it is the key part in providing the functionality of spark deployment on EC2, which is documented quite prominently on the Spark site, and the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against this functionality are therefore also filed here under EC2 module. I assume the decision to have a separate repo was for implementation/design reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed on EC2 is a really nice way of providing the functionality. But that's an assumption on my part and [~shivaram] would know best. So from a user perspective, it would appear to be part of Spark (users may not even be aware that part of the functionality lives in a separate repo). Since it's a great way to get Spark running on EC2, it would be great to get the licencing sorted out. This appears to be the best place to raise this issue. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313135#comment-14313135 ] Florian Verhein commented on SPARK-5676: Makes sense. Thanks. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5676) License missing from spark-ec2 repo
Florian Verhein created SPARK-5676: -- Summary: License missing from spark-ec2 repo Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308644#comment-14308644 ] Florian Verhein commented on SPARK-3185: [~dvohra] Sure, but the exception is thrown by tachyon... so you're not going to be able to fix it by changing the spark build SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic
Florian Verhein created SPARK-5641: -- Summary: Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic Key: SPARK-5641 URL: https://issues.apache.org/jira/browse/SPARK-5641 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time. Could also be used to override what's on the image, etc. The idea is that the user can just dump the files into deploy.generic. This can be implemented by modifying deploy_templates so that it simply copies the file (if it is of certain types), rather than treating it as a text file and replacing template variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2
[ https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304412#comment-14304412 ] Florian Verhein commented on SPARK-5552: Thanks [~sowen]. So it wouldn't fit in the spark repo itself (the only change there would be to add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It would naturally live in spark-ec2, as it involves changes to spark-ec2 for both use cases - Image creation is based on the work soon to be added to spark-ec2 for this: https://issues.apache.org/jira/browse/SPARK-3821 - Cluster deployment+configuration is done using the spark-ec2 scripts themselves (but with many modifications/fixes). Since there is a dependency between the image and the configuration (init.sh and setup.sh) scripts, it's not possible to solve this with just an AMI. The extra components (actually, just vowpal wabbit and more python libraries - the rest already exists in spark-ec2 AMI) are just added to the image for data science convenience. Automated data science AMI creation and data science cluster deployment on EC2 -- Key: SPARK-5552 URL: https://issues.apache.org/jira/browse/SPARK-5552 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Issue created RE: https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read for background) Goal: Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2, suitable for almost(?)-production use. Use cases: - A user can build their own custom data science AMIs from a CentOS minimal image by calling a packer configuration (good defaults should be provided, some options for flexibility) - A user can then easily deploy a new (correctly configured) cluster using these AMIs, and do so as quickly as possible. Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal wabbit + any rpms + ... + ganglia Focus is on reliability (rather than e.g. supporting many versions / dev testing) and speed of deployment. Use hadoop 2 so option to lift into yarn later. My current solution is here: https://github.com/florianverhein/spark-ec2/tree/packer. It includes other fixes/improvements as needed to get it working. Now that it seems to work (but has deviated a lot more from the existing code base than I was expecting), I'm wondering what to do with it... Keen to hear ideas if anyone is interested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5552) Automated data science AMIs creation and cluster deployment on EC2
Florian Verhein created SPARK-5552: -- Summary: Automated data science AMIs creation and cluster deployment on EC2 Key: SPARK-5552 URL: https://issues.apache.org/jira/browse/SPARK-5552 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Issue created RE: https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read for background) Goal: Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2, suitable for almost(?)-production use. Use cases: - A user can build their own custom data science AMIs from a CentOS minimal image by calling a packer configuration (good defaults should be provided, some options for flexibility) - A user can then easily deploy a new (correctly configured) cluster using these AMIs, and do so as quickly as possible. Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal wabbit + any rpms + ... + ganglia Focus is on reliability (rather than e.g. supporting many versions / dev testing) and speed of deployment. Use hadoop 2 so option to lift into yarn later. My current solution is here: https://github.com/florianverhein/spark-ec2/tree/packer. It includes other fixes/improvements as needed to get it working. Now that it seems to work (but has deviated a lot more from the existing code base than I was expecting), I'm wondering what to do with it... Keen to hear ideas if anyone is interested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2
[ https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-5552: --- Summary: Automated data science AMI creation and data science cluster deployment on EC2 (was: Automated data science AMIs creation and cluster deployment on EC2) Automated data science AMI creation and data science cluster deployment on EC2 -- Key: SPARK-5552 URL: https://issues.apache.org/jira/browse/SPARK-5552 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Issue created RE: https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read for background) Goal: Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2, suitable for almost(?)-production use. Use cases: - A user can build their own custom data science AMIs from a CentOS minimal image by calling a packer configuration (good defaults should be provided, some options for flexibility) - A user can then easily deploy a new (correctly configured) cluster using these AMIs, and do so as quickly as possible. Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal wabbit + any rpms + ... + ganglia Focus is on reliability (rather than e.g. supporting many versions / dev testing) and speed of deployment. Use hadoop 2 so option to lift into yarn later. My current solution is here: https://github.com/florianverhein/spark-ec2/tree/packer. It includes other fixes/improvements as needed to get it working. Now that it seems to work (but has deviated a lot more from the existing code base than I was expecting), I'm wondering what to do with it... Keen to hear ideas if anyone is interested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290923#comment-14290923 ] Florian Verhein commented on SPARK-3185: Sure [~grzegorz-dubicki]. You need to build with the correct version profiles. See for example: https://github.com/florianverhein/spark-ec2/blob/packer/spark/init.sh https://github.com/florianverhein/spark-ec2/blob/packer/tachyon/init.sh Note that I'm using Hadoop 2.4.1 (which I install on the image). SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url
[ https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-5331: --- Component/s: EC2 Description: ps -ef | grep Tachyon shows Tachyon running on the master (and the slave) node with correct setting: -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com However from stderr log on worker running the SparkTachyonPi example: 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998 15/01/20 06:00:56 ERROR : Failed to connect (1) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:57 ERROR : Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:58 ERROR : Failed to connect (3) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:59 ERROR : Failed to connect (4) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:00 ERROR : Failed to connect (5) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by: tachyon.org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185) at tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at tachyon.master.MasterClient.connect(MasterClient.java:156) ... 29 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283493#comment-14283493 ] Florian Verhein commented on SPARK-3185: I built tachyon with the correct hadoop version. fixed this problem for me. correction: spark 1.2.0 uses tachyon 0.5.0 as far as I can see... but the spark-ec2 config is for tachyon 0.4.1 (and this causes a few problems when actually trying to use tachyon) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5331) Tachyon workers seem to ignore tachyon.master.hostname and use localhost instead
Florian Verhein created SPARK-5331: -- Summary: Tachyon workers seem to ignore tachyon.master.hostname and use localhost instead Key: SPARK-5331 URL: https://issues.apache.org/jira/browse/SPARK-5331 Project: Spark Issue Type: Bug Environment: Running on EC2 via modified spark-ec2 scripts (to get dependencies right so tachyon starts) Using tachyon 0.5.0 built against hadoop 2.4.1 Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1 Tachyon configured using the template in 0.5.0 but updated with slave list and master variables etc.. Reporter: Florian Verhein ps -ef | grep Tachyon shows Tachyon running on the master (and the slave) node with correct setting: -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com However from stderr log on worker running the SparkTachyonPi example: 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998 15/01/20 06:00:56 ERROR : Failed to connect (1) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:57 ERROR : Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:58 ERROR : Failed to connect (3) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:59 ERROR : Failed to connect (4) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:00 ERROR : Failed to connect (5) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by: tachyon.org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185) at tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at tachyon.master.MasterClient.connect(MasterClient.java:156) ... 29 more
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276436#comment-14276436 ] Florian Verhein commented on SPARK-3185: I'm also getting this, though with Server IPC version 9 now that I'm using hadoop 2.4.1 (modification of the various hadoop init.sh scripts). I'm also using spark 1.2.0. My understanding is that spark-1.2.0-bin-hadoop2.4.tgz is built against hadoop 2.4 and tachyon 0.4.1. But I suspect the tachyon 0.4.1 that is installed in the spark-ec2 scripts is built against hadoop 1... Does this mean building tachyon against hadoop 2.4.1 would fix this? SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276572#comment-14276572 ] Florian Verhein commented on SPARK-3821: Thanks [~nchammas], that makes sense. Created #SPARK-5241. I'm not sure about the pre-built scenario, but am guessing e.g. http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz != http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz. So perhaps the intent is that the spark-ec2 scripts only support cdh distributions... Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly
Florian Verhein created SPARK-5241: -- Summary: spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly Key: SPARK-5241 URL: https://issues.apache.org/jira/browse/SPARK-5241 Project: Spark Issue Type: Bug Components: Build, EC2 Reporter: Florian Verhein spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This may also be an issue for tachyon dependencies. Related: tachyon appears require builds against the right version of hadoop also (probably causes this: SPARK-3185). Applies to the spark build from git checkout in spark/init.sh (I suspect this should also be changed to using mvn as that's the reference build according to the docs?). May apply to pre-built spark in spark/init.sh as well, but I'm not sure about this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are different. Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules. Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276263#comment-14276263 ] Florian Verhein commented on SPARK-3821: This is great stuff! It'll also help serve as some documentation for AMI requirements when using the spark-ec2 scripts. Re the above, I think everything in create_image.sh can be refactored to packer (+ duplicate removal - e.g. root login). I've attempted to do this in a fork of [~nchammas]'s work, but my use case is a bit different in that I need to go from a fresh centos6 minimal (rather than an amazon linux AMI) and then add other things. Possibly related to AMI generation in general: I've noticed that the version dependencies in the spark-ec2 scripts are broken. I suspect this will need to be handled in both the image and the setup. For example: - It looks like Spark needs to be built with the right hadoop profile to work, but this isn't adhered to. This applies when spark is built from a git checkout or from an existing build. This is likely also the case with Tachyon too. Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185 - The hadoop native libs are built on the image using 2.4.1, but then copied into whatever hadoop build is downloaded in the ephemeral-hdfs and persistent-hdfs scripts. I suspect that could cause issues too. Since building hadoop is very time consuming, it's something you'd wan't on the image - hence creating a dependency. - The version dependencies for other things like ganglia aren't documented (I believe this is installed on the image but duplicated again in spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but recall I'm using a different base AMI, so I'll likely get a different ganglia version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't work across the hadoop versions either (but, fingers crossed!). Re the above, I might try keeping the entire hadoop build (from the image creation) for the hdfs setup. Sorry for the sidetrack, but struggling though all this so hoping it might ring a bell for someone. p.s. With the image automation, it might also be worth considering putting more on the image as an option (esp for people happy to build their own AMIs). For example, I see no reason why the module init.sh scripts can't be run from packer in order to speed start-up times of the cluster :) Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org