from:"Florian Verhein \(JIRA\)"

[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394147#comment-14394147
]

Florian Verhein commented on SPARK-6664:

I guess the other thing is - we can union RDDs, so why not be able to 'undo'
that?

Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
--

Key: SPARK-6664
URL: https://issues.apache.org/jira/browse/SPARK-6664
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Florian Verhein

I can't find this functionality (if I missed something, apologies!), but it
would be very useful for evaluating ml models.
*Use case example*
suppose you have pre-processed web logs for a few months, and now want to
split it into a training set (where you train a model to predict some aspect
of site accesses, perhaps per user) and an out of time test set (where you
evaluate how well your model performs in the future). This example has just a
single split, but in general you could want more for cross validation. You
may also want to have multiple overlaping intervals.
*Specification*
1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys),
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and
ith boundary.
2. More complex alternative (but similar under the hood): provide a sequence
of possibly overlapping intervals (ordered by the start key of the interval),
and return the RDDs containing values within those intervals.
*Implementation ideas / notes for 1*
- The ordered RDDs are likely RangePartitioned (or there should be a simple
way to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.
- Construct the new RDDs from the original partitions (and any split ones)
I suspect this could be done by launching only a few jobs to split the
partitions containing the boundaries.
Alternatively, it might be possible to decorate these partitions and use them
in more than one RDD. I.e. let one of these partitions (for boundary i) be p.
Apply two decorators p' and p'', where p' is masks out values above the ith
boundary, and p'' masks out values below the ith boundary. Any operations on
these partitions apply only to values not masked out. Then assign p' to the
ith output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure
whether it's worth trying this optimisation.
*Implementation ideas / notes for 2*
This is very similar, except that we have to handle entire (or parts) of
partitions belonging to more than one output RDD, since they are no longer
mutually exclusive. But since RDDs are immutable(??), the decorator idea
should still work?
Thoughts?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394141#comment-14394141
]

Florian Verhein commented on SPARK-6664:

Thanks [~sowen]. I disagree :-)

...If you think there's non-stationarity you most certainly want to see how
well a model trained in the past holds up in the future (possibly with more
than one out of time sample if one is used for pruning, etc), and you can do
this for temporal data by adjusting the way you do cross validation...
actually, the exact method you describe is one common approach in time series
data, e.g. see
http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
Doing this multiple times does exactly what is does for normal cross-validation
- gives you a distribution of your error estimate, rather than a single value
(a sample of it). So it's quite important. The size of the data isn't really
relevant to this argument (also consider that I might like to employ larger
datasets to remove the risk of overfitting a more complex but better fitting
model, rather than to improve my error estimates).

Note that this proposal doesn't define how the split RDDs are used (i.e.
unioned) to create training sets and test sets. So the test set can be a single
RDD, or multiple ones. It's entirely up to the user.

Allowing overlapping partitions (i.e. part 2) is a little different, because
you probably wouldn't union the resulting RDDs due to duplication. It would be
more useful for as a primitive for bootstrapping the performance measures of
streaming models or simulations (so, you're not resampling records, but
resampling subsequences).
Alternatively if you have big data but a class imbalance problem, you might
need to resort to overlaps in the training sets to get multiple test sets with
enough examples of your minority class.

From what I understand MLUtils.kFold is standard randomised k-fold cross
validation *but without shuffling* (from a cursory look at the code, It looks
like ordering will always be maintained... which should probably be documented
if it is the case because it can lead to bad things... and adds another
argument for #6665). Either way, since elements of its splits are
non-consecutive, it's not applicable for time series.

Do you know how the performance of filterByRange would compare? It should be
pretty performant if and only if the data is RangePartitioned right?

Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
--

Key: SPARK-6664
URL: https://issues.apache.org/jira/browse/SPARK-6664
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Florian Verhein

[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

2015-04-03 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394291#comment-14394291
 ] 

Florian Verhein commented on SPARK-6665:


Fair enough. I'll have to implement it because I need it so may as well report 
back when I've had the chance to (perhaps there's a better place for it - e.g. 
not in the core API). 


 Randomly Shuffle an RDD 
 

 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor

 *Use case* 
 RDD created in a way that has some ordering, but you need to shuffle it 
 because the ordering would cause problems downstream. E.g.
 - will be used to train a ML algorithm that makes stochastic assumptions 
 (like SGD) 
 - used as input for cross validation. e.g. after the shuffle, you could just 
 grab partitions (or part files if saved to hdfs) as folds
 Related question in mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
 *Possible implementation*
 As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
 the element (or key if it's paired) and a random salt)). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

2015-04-02 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394089#comment-14394089
]

Florian Verhein commented on SPARK-6665:

Thanks for the quick response [~sowen].

Agree with your observation, but consider a) distributing the cross validation
itself (so one job will achieve all the training and scoring on the k fold
selections) and b) using the pre-processed and shuffled data for non-spark
modelling, such as in R or python or vowpal wabbit (perhaps all running within
spark jobs, using something like sc.paralellise(jobs,jobs.size).map(_()) to
treat spark as a grid). So if the splits already exist on hdfs it is very easy
to use them -- and since you can control the number of partitions easily this
gives a very simple way to quickly get something up and running in R or python,
even if the data is big. But this is really just a nice data science hacking
side effect of this feature, rather than a driving use case. I don't really
agree that taking random subsamples is better because you run the risk of never
selecting some instances.

Agree that the most important use case is random order for subsequent serial
access (but disagree it's limited to small RDDs). For example, if you use spark
for pre-processing followed by a large scale learner like vowpal wabbit (note
that vw has features that mllib SGD doesn't have yet), the data should be
shuffled since vw processes out of core, so cannot perform the randomisation
itself through order selection (and it would really slow down the algorithm if
it did).

It's worth pointing out that shuffling a dataset is a common enough operation
for it to exists in other big data frameworks - e.g. I've used it in ML
pipelines written in Scoobi and Scalding.
I haven't implemented it myself, but I'm pretty sure it's non trivial to make
it performant and have good randomness properties.
So I think there's a good case to add it.

Randomly Shuffle an RDD

Key: SPARK-6665
URL: https://issues.apache.org/jira/browse/SPARK-6665
Project: Spark
Issue Type: New Feature
Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor

*Use case*
RDD created in a way that has some ordering, but you need to shuffle it
because the ordering would cause problems downstream. E.g.
- will be used to train a ML algorithm that makes stochastic assumptions
(like SGD)
- used as input for cross validation. e.g. after the shuffle, you could just
grab partitions (or part files if saved to hdfs) as folds
Related question in mailing list:
http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
*Possible implementation*
As mentioned by [~sowen] in the above thread, could sort by( a good hash of(
the element (or key if it's paired) and a random salt)).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6664:
---
Description: 
I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

*Use case example* 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

*Specification* 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals (ordered by the start key of the interval), and 
return the RDDs containing values within those intervals. 

*Implementation ideas / notes for 1*

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

*Implementation ideas / notes for 2*
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(??), the decorator idea should 
still work?

Thoughts?


  was:

I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?



 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664

[jira] [Created] (SPARK-6665) Randomly Shuffle an RDD

2015-04-01 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-6665:
--

 Summary: Randomly Shuffle an RDD 
 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor


*Use case* 
RDD created in a way that has some ordering, but you need to shuffle it because 
the ordering would cause problems downstream. E.g.
- will be used to train a ML algorithm that makes stochastic assumptions (like 
SGD) 
- used as input for cross validation. e.g. after the shuffle, you could just 
grab partitions (or part files if saved to hdfs) as folds

Related question in mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html

*Possible implementation*
As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
the element (or key if it's paired) and a random salt)). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391950#comment-14391950
]

Florian Verhein commented on SPARK-6664:

The closest approach I've found that should achieve the same result is calling
OrderedRDDFunctions.filterByRange n+1 times. I assume this approach will be
much slower, but... it may not be if it's completely lazy.. (??). I don't know
spark well enough yet to be anywhere near sure of this.

Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
--

Key: SPARK-6664
URL: https://issues.apache.org/jira/browse/SPARK-6664
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Florian Verhein

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-6664:
--

 Summary: Split Ordered RDD into multiple RDDs by keys (boundaries 
or intervals)
 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein



I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]

  was:

Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires [#6600]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-6601:
--

 Summary: Add HDFS NFS gateway module to spark-ec2
 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein



Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Florian Verhein updated SPARK-6600:
---
Description:
Use case: User has set up the hadoop hdfs nfs gateway service on their
spark-ec2.py launched cluster, and wants to mount that on their local machine.

Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway
module in the spark-ec2 project. That should be a separate issue (TODO).

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

was:

Use case: User has set up the hadoop hdfs nfs gateway service on their
spark-ec2.py launched cluster, and wants to mount that on their local machine.

Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway
module in the spark-ec2 project. That should be a separate issue (TODO).

Reference:
https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

Open ports in spark-ec2.py to allow HDFS NFS gateway
--

Key: SPARK-6600
URL: https://issues.apache.org/jira/browse/SPARK-6600
Project: Spark
Issue Type: New Feature
Components: EC2
Reporter: Florian Verhein

Use case: User has set up the hadoop hdfs nfs gateway service on their
spark-ec2.py launched cluster, and wants to mount that on their local
machine.
Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)
Note that this issue *does not* cover the implementation of a hdfs nfs
gateway module in the spark-ec2 project. That should be a separate issue
(TODO).
Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: 
Open ports in spark-ec2.py to allow HDFS NFS gateway  )

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark-ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. That should be a separate issue 
 (TODO).  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Florian Verhein updated SPARK-6600:
---
Description:
Use case: User has set up the hadoop hdfs nfs gateway service on their
spark_ec2.py launched cluster, and wants to mount that on their local machine.

Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway
module in the spark-ec2 project. See [#6601] for this.

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

was:
Use case: User has set up the hadoop hdfs nfs gateway service on their
spark_ec2.py launched cluster, and wants to mount that on their local machine.

Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway
module in the spark-ec2 project. That should be a separate issue (TODO).

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
--

Key: SPARK-6600
URL: https://issues.apache.org/jira/browse/SPARK-6600
Project: Spark
Issue Type: New Feature
Components: EC2
Reporter: Florian Verhein

Use case: User has set up the hadoop hdfs nfs gateway service on their
spark_ec2.py launched cluster, and wants to mount that on their local
machine.
Requires the following ports to be opened on incoming rule set for MASTER for
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)
Note that this issue *does not* cover the implementation of a hdfs nfs
gateway module in the spark-ec2 project. See [#6601] for this.
Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600

  was:
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-19 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328612#comment-14328612
 ] 

Florian Verhein commented on SPARK-5879:


cc [~shivaram], any opinions on how to best do this?

 spary_ec2.py should expose/return master and slave lists (e.g. write to file)
 -

 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein

 After running spark_ec2.py, it is often useful/necessary to know the master's 
 ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.
 For example, consider a wrapper that launches a cluster, then waits for 
 completion of some application running on it (e.g. polling via ssh), before 
 destroying the cluster.
 Some options: 
 - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically 
 a subset of the ec2_variables.sh that is temporarily created as part of 
 deploy_files variable substitution)
 - launch-variables.json (same info but as json) 
 Both would be useful depending on the wrapper language. 
 I think we should incorporate the cluster name for the case that multiple 
 clusters are launched. E.g. cluster_name_variables.sh/.json
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-17 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5879:
--

 Summary: spary_ec2.py should expose/return master and slave lists 
(e.g. write to file)
 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein



After running spark_ec2.py, it is often useful/necessary to know the master's 
ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.

For example, consider a wrapper that launches a cluster, then waits for 
completion of some application running on it (e.g. polling via ssh), before 
destroying the cluster.

Some options: 
- write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a 
subset of the ec2_variables.sh that is temporarily created as part of 
deploy_files variable substitution)
- launch-variables.json (same info but as json) 

Both would be useful depending on the wrapper language. 

I think we should incorporate the cluster name for the case that multiple 
clusters are launched. E.g. cluster_name_variables.sh/.json

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate

2015-02-17 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324986#comment-14324986
 ] 

Florian Verhein commented on SPARK-5851:


That makes sense.

Yeah, I ran into it yesterday. My spark-ec2/setup.sh failed (had set -u set on 
a new component I was testing), resulting in looping over setup.sh calls. 
In this case, spark_ec2.py shouldn't retry, but fail gracefully (ideally after 
performing cleanup of the cluster, and returning a failure code)

 spark_ec2.py ssh failure retry handling not always appropriate
 --

 Key: SPARK-5851
 URL: https://issues.apache.org/jira/browse/SPARK-5851
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 The following function doesn't distinguish between the ssh failing (e.g. 
 presumably a connection issue) and the remote command that it executes 
 failing (e.g. setup.sh). The latter should probably not result in a retry. 
 Perhaps tries could be an argument that is set to 1 for certain usages. 
 # Run a command on a host through ssh, retrying up to five times
 # and then throwing an exception if ssh continues to fail.
 spark-ec2: [{{def ssh(host, opts, 
 command)}}|https://github.com/apache/spark/blob/d8f69cf78862d13a48392a0b94388b8d403523da/ec2/spark_ec2.py#L953-L975]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5851) spark_ec2.py ssh failure retry handling not always appropriate

2015-02-16 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5851:
--

 Summary: spark_ec2.py ssh failure retry handling not always 
appropriate
 Key: SPARK-5851
 URL: https://issues.apache.org/jira/browse/SPARK-5851
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein
Priority: Minor



The following function doesn't distinguish between the ssh failing (e.g. 
presumably a connection issue) and the remote command that it executes failing 
(e.g. setup.sh). The latter should probably not result in a retry. 

Perhaps tries could be an argument that is set to 1 for certain usages. 

# Run a command on a host through ssh, retrying up to five times
# and then throwing an exception if ssh continues to fail.
def ssh(host, opts, command):



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-16 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322611#comment-14322611
 ] 

Florian Verhein commented on SPARK-5813:


I think it's a good idea to stick to vendor recommendations, but since I can't 
point to any concrete benefits and there is complexity around handling 
licensing issues, I don't think there's a good argument for tackling this.

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-16 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein closed SPARK-5813.
--
Resolution: Won't Fix

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-15 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321764#comment-14321764
]

Florian Verhein commented on SPARK-5813:

INAL but here are my thoughts:

The user ends up downloading it from Oracle and accepting the license terms in
that process, so as long as they are (or made) aware then I don't really see a
problem. It's just providing a mechanism for them to do this. i.e. It's not a
redistribution issue.
I think a reasonable solution to this would be to have OpenJDK as the default,
with OracleJDK as an option that the user must specifically request (and the
option's documentation indicating that this entails acceptance of a license...
etc)

At least, *the above is true in the case where the user builds their own AMI
(that's the approach I take since it best suits my requirements). With provided
AMIs I think this is more complex, because I would assume that is
redistribution*. I guess that applies to any software that is put on the AMI
actually... so this may be an issue that needs looking at more generally...
I don't know how to best approach that case other than adhering to any
redistribution terms including these as part of an EULA for spark-ec2/AMIs or
something?

But with the work [~nchammas] has done, I suppose the easiest way would be to
provide the public AMIs with OpenJDK, and add an option to build ones with
OracleJDK if the user is inclined to do this themselves.

Hmmm... is this worthwhile?

Spark-ec2: Switch to OracleJDK
--

Key: SPARK-5813
URL: https://issues.apache.org/jira/browse/SPARK-5813
Project: Spark
Issue Type: Improvement
Components: EC2
Reporter: Florian Verhein
Priority: Minor

Currently using OpenJDK, however it is generally recommended to use Oracle
JDK, esp for Hadoop deployments, etc.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-15 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322208#comment-14322208
 ] 

Florian Verhein commented on SPARK-5813:


Good point. I think you're right re: scripting away - I understand it's 
sometimes done by sysadmins/ops to automate their installation processes 
in-house, but that is a different situation. Thanks for that. 

spark_ec2 works by looking up an existing ami and using it to instantiate ec2 
instances. I don't know who currently maintains these. 



 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-14 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321748#comment-14321748
 ] 

Florian Verhein commented on SPARK-5813:


No specific technical reason esp WRT Spark... It's more of an attempt to keep 
in line with recommendations for Hadoop in production (relevant since hadoop is 
included in spark-ec2 - and cdh seems to be favoured). For example, CDH 
supports OracleJDK, Horton didn't support OpenJDK before 1.7 and OracleJDK 
still seems to be the favoured choice in production deployments, e.g. 
http://wiki.apache.org/hadoop/HadoopJavaVersions. 

I don't have first had data about how they compare performance wise. I've heard 
OracleJDK being preferred for Hadoop on that front, but I also found this 
http://www.slideshare.net/PrincipledTechnologies/big-data-technology-on-red-hat-enterprise-linux-openjdk-vs-oracle-jdk,
 so perhaps performance is less of a reason these days?

Do you know of any performance analysis done with Spark, Tachyon on OpenJDK vs 
OracleJDK?

In terms of difficulty, it's not hard to script installation of OracleJDK. E.g. 
I've gone down the path of supporting both for the above reasons here (link may 
break in future): 
https://github.com/florianverhein/spark-ec2/blob/packer/packer/java-setup.sh

Aside: Based on bugs you mentioned, is there a list somewhere of which JDK 
versions to avoid WRT Spark?

 Spark-ec2: Switch to OracleJDK
 --

 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor

 Currently using OpenJDK, however it is generally recommended to use Oracle 
 JDK, esp for Hadoop deployments, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-13 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14320995#comment-14320995
 ] 

Florian Verhein commented on SPARK-3821:


RE: Java, that reminds me... We should probably be using OracleJDK rather than 
OpenJDK. But I think this should be a separate issue, so just created 
#SPARK-5813.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5813) Spark-ec2: Switch to OracleJDK

2015-02-13 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5813:
--

 Summary: Spark-ec2: Switch to OracleJDK
 Key: SPARK-5813
 URL: https://issues.apache.org/jira/browse/SPARK-5813
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor


Currently using OpenJDK, however it is generally recommended to use Oracle JDK, 
esp for Hadoop deployments, etc. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster

2015-02-12 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Florian Verhein updated SPARK-5641:
---
Description:
*Updated - no longer via deploy.generic, no substitutions*

Essentially, give users an easy way to rcp a directory structure to the
master's / as part of the cluster launch, at a useful point in the workflow
(before setup.sh is called on the master).

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to
install extra stuff at cluster deployment time.

However note that it could also be used to override / add to either:
- what's on the image
- what gets cloned from spark-ec2 (e.g. add new module)

was:

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to
install extra stuff at cluster deployment time.

However note that it could also be used to override either:
- what's on the image
- what gets cloned from spark-ec2 (since deploy_files runs afterwards)

The idea is that the user can just dump the files into ec2/deploy.generic/.

This can be implemented by modifying deploy_files so that it simply copies the
file (if it is of certain types), rather than treating it as a text file and
attempting to replace template variables.

Detecting binary files is non-trivial. So the proposal is to have a list of
file extensions that will trigger simple file copying.

Allow spark_ec2.py to copy arbitrary files to cluster
-

Key: SPARK-5641
URL: https://issues.apache.org/jira/browse/SPARK-5641
Project: Spark
Issue Type: Improvement
Components: EC2
Reporter: Florian Verhein
Priority: Minor

*Updated - no longer via deploy.generic, no substitutions*
Essentially, give users an easy way to rcp a directory structure to the
master's / as part of the cluster launch, at a useful point in the workflow
(before setup.sh is called on the master).
Useful if binary files need to be uploaded. E.g. I use this for rpm transfer
to install extra stuff at cluster deployment time.
However note that it could also be used to override / add to either:
- what's on the image
- what gets cloned from spark-ec2 (e.g. add new module)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic

2015-02-09 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Florian Verhein updated SPARK-5641:
---
Description:

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to
install extra stuff at cluster deployment time.

However note that it could also be used to override either:
- what's on the image
- what gets cloned from spark-ec2 (since deploy_files runs afterwards)

The idea is that the user can just dump the files into ec2/deploy.generic/.

This can be implemented by modifying deploy_files so that it simply copies the
file (if it is of certain types), rather than treating it as a text file and
attempting to replace template variables.

Detecting binary files is non-trivial. So the proposal is to have a list of
file extensions that will trigger simple file copying.

was:

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to
install extra stuff at cluster deployment time.

Could also be used to override what's on the image, etc.

The idea is that the user can just dump the files into deploy.generic.

This can be implemented by modifying deploy_templates so that it simply copies
the file (if it is of certain types), rather than treating it as a text file
and replacing template variables.

Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic

Key: SPARK-5641
URL: https://issues.apache.org/jira/browse/SPARK-5641
Project: Spark
Issue Type: Improvement
Components: EC2
Reporter: Florian Verhein
Priority: Minor

Useful if binary files need to be uploaded. E.g. I use this for rpm transfer
to install extra stuff at cluster deployment time.
However note that it could also be used to override either:
- what's on the image
- what gets cloned from spark-ec2 (since deploy_files runs afterwards)
The idea is that the user can just dump the files into ec2/deploy.generic/.
This can be implemented by modifying deploy_files so that it simply copies
the file (if it is of certain types), rather than treating it as a text file
and attempting to replace template variables.
Detecting binary files is non-trivial. So the proposal is to have a list of
file extensions that will trigger simple file copying.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102
]

Florian Verhein edited comment on SPARK-5676 at 2/9/15 11:06 PM:
-

[~srowen] Yep, that's the one.

I assume the decision to have a separate repo was for implementation/design
reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed
on EC2 is a really nice way of providing the functionality. But that's an
assumption on my part and [~shivaram] would know best.

So from a user perspective, it would appear to be part of Spark (users may not
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get
the licencing sorted out. This appears to be the best place to raise this issue.

was (Author: florianverhein):
[~srowen] Yep, that's the one.

True. However it is the key part in providing the functionality of spark
deployment on EC2, which is documented quite prominently on the Spark site, and
the entry point of which is in the spark repo (ec2/spark_ec2.py). Bugs against
this functionality are therefore also filed here under EC2 module.
I assume the decision to have a separate repo was for implementation/design
reasons ( ?? ). Having spark_ec2.py cause this repo to be cloned and executed
on EC2 is a really nice way of providing the functionality. But that's an
assumption on my part and [~shivaram] would know best.

So from a user perspective, it would appear to be part of Spark (users may not
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get
the licencing sorted out. This appears to be the best place to raise this issue.

License missing from spark-ec2 repo
---

Key: SPARK-5676
URL: https://issues.apache.org/jira/browse/SPARK-5676
Project: Spark
Issue Type: Bug
Components: EC2
Reporter: Florian Verhein

There is no LICENSE file or licence headers in the code in the spark-ec2
repo. Also, I believe there is no contributor license agreement notification
in place (like there is in the main spark repo).
It would be great to fix this (sooner better than later while contributors
list is small), so that users wishing to use this part of Spark are not in
doubt over licensing issues.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313102#comment-14313102
]

Florian Verhein commented on SPARK-5676:

[~srowen] Yep, that's the one.

So from a user perspective, it would appear to be part of Spark (users may not
even be aware that part of the functionality lives in a separate repo).

Since it's a great way to get Spark running on EC2, it would be great to get
the licencing sorted out. This appears to be the best place to raise this issue.

License missing from spark-ec2 repo
---

Key: SPARK-5676
URL: https://issues.apache.org/jira/browse/SPARK-5676
Project: Spark
Issue Type: Bug
Components: EC2
Reporter: Florian Verhein

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313135#comment-14313135
 ] 

Florian Verhein commented on SPARK-5676:


Makes sense. Thanks.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5676) License missing from spark-ec2 repo

2015-02-08 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5676:
--

 Summary: License missing from spark-ec2 repo
 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein



There is no LICENSE file or licence headers in the code in the spark-ec2 repo. 
Also, I believe there is no contributor license agreement notification in place 
(like there is in the main spark repo).

It would be great to fix this (sooner better than later while contributors list 
is small), so that users wishing to use this part of Spark are not in doubt 
over licensing issues.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-02-05 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308644#comment-14308644
 ] 

Florian Verhein commented on SPARK-3185:


[~dvohra] Sure, but the exception is thrown by tachyon... so you're not going 
to be able to fix it by changing the spark build

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5641) Allow spark_ec2.py to copy arbitrary files to cluster via deploy.generic

2015-02-05 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5641:
--

 Summary: Allow spark_ec2.py to copy arbitrary files to cluster via 
deploy.generic
 Key: SPARK-5641
 URL: https://issues.apache.org/jira/browse/SPARK-5641
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein
Priority: Minor



Useful if binary files need to be uploaded. E.g. I use this for rpm transfer to 
install extra stuff at cluster deployment time.

Could also be used to override what's on the image, etc.

The idea is that the user can just dump the files into deploy.generic. 

This can be implemented by modifying deploy_templates so that it simply copies 
the file (if it is of certain types), rather than treating it as a text file 
and replacing template variables. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2

2015-02-03 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304412#comment-14304412
]

Florian Verhein commented on SPARK-5552:

Thanks [~sowen].

So it wouldn't fit in the spark repo itself (the only change there would be to
add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It
would naturally live in spark-ec2, as it involves changes to spark-ec2 for
both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this:
https://issues.apache.org/jira/browse/SPARK-3821
- Cluster deployment+configuration is done using the spark-ec2 scripts
themselves (but with many modifications/fixes).

Since there is a dependency between the image and the configuration (init.sh
and setup.sh) scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries -
the rest already exists in spark-ec2 AMI) are just added to the image for data
science convenience.

Automated data science AMI creation and data science cluster deployment on EC2
--

Key: SPARK-5552
URL: https://issues.apache.org/jira/browse/SPARK-5552
Project: Spark
Issue Type: New Feature
Components: EC2
Reporter: Florian Verhein

Issue created RE:
https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read
for background)
Goal:
Extend spark-ec2 scripts to create an automated data science cluster
deployment on EC2, suitable for almost(?)-production use.
Use cases:
- A user can build their own custom data science AMIs from a CentOS minimal
image by calling a packer configuration (good defaults should be provided,
some options for flexibility)
- A user can then easily deploy a new (correctly configured) cluster using
these AMIs, and do so as quickly as possible.
Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R
+ vowpal wabbit + any rpms + ... + ganglia
Focus is on reliability (rather than e.g. supporting many versions / dev
testing) and speed of deployment.
Use hadoop 2 so option to lift into yarn later.
My current solution is here:
https://github.com/florianverhein/spark-ec2/tree/packer. It includes other
fixes/improvements as needed to get it working.
Now that it seems to work (but has deviated a lot more from the existing code
base than I was expecting), I'm wondering what to do with it...
Keen to hear ideas if anyone is interested.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5552) Automated data science AMIs creation and cluster deployment on EC2

2015-02-02 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5552:
--

 Summary: Automated data science AMIs creation and cluster 
deployment on EC2
 Key: SPARK-5552
 URL: https://issues.apache.org/jira/browse/SPARK-5552
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein


Issue created RE: 
https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
for background)

Goal:
Extend spark-ec2 scripts to create an automated data science cluster deployment 
on EC2, suitable for almost(?)-production use.

Use cases: 
- A user can build their own custom data science AMIs from a CentOS minimal 
image by calling a packer configuration (good defaults should be provided, some 
options for flexibility)
- A user can then easily deploy a new (correctly configured) cluster using 
these AMIs, and do so as quickly as possible.

Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + 
vowpal wabbit + any rpms + ... + ganglia

Focus is on reliability (rather than e.g. supporting many versions / dev 
testing) and speed of deployment.
Use hadoop 2 so option to lift into yarn later.

My current solution is here: 
https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
fixes/improvements as needed to get it working.

Now that it seems to work (but has deviated a lot more from the existing code 
base than I was expecting), I'm wondering what to do with it...

Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2

2015-02-02 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Florian Verhein updated SPARK-5552:
---
Summary: Automated data science AMI creation and data science cluster
deployment on EC2 (was: Automated data science AMIs creation and cluster
deployment on EC2)

Automated data science AMI creation and data science cluster deployment on EC2
--

Key: SPARK-5552
URL: https://issues.apache.org/jira/browse/SPARK-5552
Project: Spark
Issue Type: New Feature
Components: EC2
Reporter: Florian Verhein

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-24 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290923#comment-14290923
 ] 

Florian Verhein commented on SPARK-3185:



Sure [~grzegorz-dubicki]. You need to build with the correct version profiles. 
See for example:

https://github.com/florianverhein/spark-ec2/blob/packer/spark/init.sh
https://github.com/florianverhein/spark-ec2/blob/packer/tachyon/init.sh

Note that I'm using Hadoop 2.4.1 (which I install on the image).


 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2015-01-20 Thread Florian Verhein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-5331:
---
Component/s: EC2
Description: 
ps -ef | grep Tachyon 
shows Tachyon running on the master (and the slave) node with correct setting:
-Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com

However from stderr log on worker running the SparkTachyonPi example:

15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at 
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at 
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
at 
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at 
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-19 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283493#comment-14283493
 ] 

Florian Verhein commented on SPARK-3185:


I built tachyon with the correct hadoop version. fixed this problem for me.
correction: spark 1.2.0 uses tachyon 0.5.0 as far as I can see... but the 
spark-ec2 config is for tachyon 0.4.1 (and this causes a few problems when 
actually trying to use tachyon)


 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5331) Tachyon workers seem to ignore tachyon.master.hostname and use localhost instead

2015-01-19 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5331:
--

 Summary: Tachyon workers seem to ignore tachyon.master.hostname 
and use localhost instead
 Key: SPARK-5331
 URL: https://issues.apache.org/jira/browse/SPARK-5331
 Project: Spark
  Issue Type: Bug
 Environment: Running on EC2 via modified spark-ec2 scripts (to get 
dependencies right so tachyon starts)
Using tachyon 0.5.0 built against hadoop 2.4.1
Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1
Tachyon configured using the template in 0.5.0 but updated with slave list and 
master variables etc..

Reporter: Florian Verhein



ps -ef | grep Tachyon 
shows Tachyon running on the master (and the slave) node with correct setting:
-Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com

However from stderr log on worker running the SparkTachyonPi example:

15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at 
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at 
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at 
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
at 
org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at 
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-13 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276436#comment-14276436
 ] 

Florian Verhein commented on SPARK-3185:


I'm also getting this, though with Server IPC version 9 now that I'm using 
hadoop 2.4.1 (modification of the various hadoop init.sh scripts). I'm also 
using spark 1.2.0.

My understanding is that spark-1.2.0-bin-hadoop2.4.tgz is built against hadoop 
2.4 and tachyon 0.4.1. 
But I suspect the tachyon 0.4.1 that is installed in the spark-ec2 scripts is 
built against hadoop 1...

Does this mean building tachyon against hadoop 2.4.1 would fix this?

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276572#comment-14276572
 ] 

Florian Verhein commented on SPARK-3821:


Thanks [~nchammas], that makes sense.

Created #SPARK-5241.
I'm not sure about the pre-built scenario, but am guessing e.g. 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz != 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz. So 
perhaps the intent is that the spark-ec2 scripts only support cdh 
distributions...  

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly

2015-01-13 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-5241:
--

 Summary: spark-ec2 spark init scripts do not handle all hadoop (or 
tachyon?) dependencies correctly
 Key: SPARK-5241
 URL: https://issues.apache.org/jira/browse/SPARK-5241
 Project: Spark
  Issue Type: Bug
  Components: Build, EC2
Reporter: Florian Verhein



spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This 
may also be an issue for tachyon dependencies. Related: tachyon appears require 
builds against the right version of hadoop also (probably causes this: 
SPARK-3185). 

Applies to the spark build from git checkout in spark/init.sh (I suspect this 
should also be changed to using mvn as that's the reference build according to 
the docs?).

May apply to pre-built spark in spark/init.sh as well, but I'm not sure about 
this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are 
different.

Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is 
used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules.

Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276263#comment-14276263
]

Florian Verhein commented on SPARK-3821:

This is great stuff! It'll also help serve as some documentation for AMI
requirements when using the spark-ec2 scripts.

Re the above, I think everything in create_image.sh can be refactored to packer
(+ duplicate removal - e.g. root login). I've attempted to do this in a fork of
[~nchammas]'s work, but my use case is a bit different in that I need to go
from a fresh centos6 minimal (rather than an amazon linux AMI) and then add
other things.

Possibly related to AMI generation in general: I've noticed that the version
dependencies in the spark-ec2 scripts are broken. I suspect this will need to
be handled in both the image and the setup. For example:
- It looks like Spark needs to be built with the right hadoop profile to work,
but this isn't adhered to. This applies when spark is built from a git checkout
or from an existing build. This is likely also the case with Tachyon too.
Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185
- The hadoop native libs are built on the image using 2.4.1, but then copied
into whatever hadoop build is downloaded in the ephemeral-hdfs and
persistent-hdfs scripts. I suspect that could cause issues too. Since building
hadoop is very time consuming, it's something you'd wan't on the image - hence
creating a dependency.
- The version dependencies for other things like ganglia aren't documented (I
believe this is installed on the image but duplicated again in
spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but
recall I'm using a different base AMI, so I'll likely get a different ganglia
version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't
work across the hadoop versions either (but, fingers crossed!).

Re the above, I might try keeping the entire hadoop build (from the image
creation) for the hdfs setup.

Sorry for the sidetrack, but struggling though all this so hoping it might ring
a bell for someone.

p.s. With the image automation, it might also be worth considering putting more
on the image as an option (esp for people happy to build their own AMIs). For
example, I see no reason why the module init.sh scripts can't be run from
packer in order to speed start-up times of the cluster :)

Develop an automated way of creating Spark images (AMI, Docker, and others)
---

Key: SPARK-3821
URL: https://issues.apache.org/jira/browse/SPARK-3821
Project: Spark
Issue Type: Improvement
Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Attachments: packer-proposal.html

Right now the creation of Spark AMIs or Docker containers is done manually.
With tools like [Packer|http://www.packer.io/], we should be able to automate
this work, and do so in such a way that multiple types of machine images can
be created from a single template.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

44 matches

Mail list logo