Re: s3 vfs on Mesos Slaves

2015-05-14 Thread Haoyuan Li
Another way is to configure S3 as Tachyon's under storage system, and then
run Spark on Tachyon.

More info: http://tachyon-project.org/Setup-UFS.html

Best,

Haoyuan

On Wed, May 13, 2015 at 10:52 AM, Stephen Carman 
wrote:

> Thank you for the suggestions, the problem exists in the fact we need to
> initialize the vfs s3 driver so what you suggested Akhil wouldn’t fix the
> problem.
>
> Basically a job is submitted to the cluster and it tries to pull down the
> data from s3, but fails because the s3 uri hasn’t been initilized in the
> vfs and it doesn’t know how to handle
> the URI.
>
> What I’m asking is, how do we before the job is ran, run some
> bootstrapping or setup code that will let us do this initialization or
> configuration step for the vfs so that when it executes the job
> it has the information it needs to be able to handle the s3 URI.
>
> Thanks,
> Steve
>
> On May 13, 2015, at 12:35 PM, jay vyas  > wrote:
>
>
> Might I ask why vfs?  I'm new to vfs and not sure wether or not it
> predates the hadoop file system interfaces (HCFS).
>
> After all spark natively supports any HCFS by leveraging the hadoop
> FileSystem api and class loaders and so on.
>
> So simply putting those resources on your classpath should be sufficient
> to directly connect to s3. By using the sc.hadoopFile (...) commands.
>
> On May 13, 2015 12:16 PM, "Akhil Das"  ak...@sigmoidanalytics.com>> wrote:
> Did you happened to have a look at this https://github.com/abashev/vfs-s3
>
> Thanks
> Best Regards
>
> On Tue, May 12, 2015 at 11:33 PM, Stephen Carman  >
> wrote:
>
> > We have a small mesos cluster and these slaves need to have a vfs setup
> on
> > them so that the slaves can pull down the data they need from S3 when
> spark
> > runs.
> >
> > There doesn’t seem to be any obvious way online on how to do this or how
> > easily accomplish this. Does anyone have some best practices or some
> ideas
> > about how to accomplish this?
> >
> > An example stack trace when a job is ran on the mesos cluster…
> >
> > Any idea how to get this going? Like somehow bootstrapping spark on run
> or
> > something?
> >
> > Thanks,
> > Steve
> >
> >
> > java.io.IOException: Unsupported scheme s3n for URI s3n://removed
> > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> > at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
> > at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
> > 1)
> > java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
> > for URI s3n://removed
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> > at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
> > at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.io.IOException: Unsupported scheme s3n for URI
> > s3n://removed
> > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> >   

Re: s3 vfs on Mesos Slaves

2015-05-13 Thread Stephen Carman
Thank you for the suggestions, the problem exists in the fact we need to 
initialize the vfs s3 driver so what you suggested Akhil wouldn’t fix the 
problem.

Basically a job is submitted to the cluster and it tries to pull down the data 
from s3, but fails because the s3 uri hasn’t been initilized in the  vfs and it 
doesn’t know how to handle
the URI.

What I’m asking is, how do we before the job is ran, run some bootstrapping or 
setup code that will let us do this initialization or configuration step for 
the vfs so that when it executes the job
it has the information it needs to be able to handle the s3 URI.

Thanks,
Steve

On May 13, 2015, at 12:35 PM, jay vyas 
mailto:jayunit100.apa...@gmail.com>> wrote:


Might I ask why vfs?  I'm new to vfs and not sure wether or not it predates the 
hadoop file system interfaces (HCFS).

After all spark natively supports any HCFS by leveraging the hadoop FileSystem 
api and class loaders and so on.

So simply putting those resources on your classpath should be sufficient to 
directly connect to s3. By using the sc.hadoopFile (...) commands.

On May 13, 2015 12:16 PM, "Akhil Das" 
mailto:ak...@sigmoidanalytics.com>> wrote:
Did you happened to have a look at this https://github.com/abashev/vfs-s3

Thanks
Best Regards

On Tue, May 12, 2015 at 11:33 PM, Stephen Carman 
mailto:scar...@coldlight.com>>
wrote:

> We have a small mesos cluster and these slaves need to have a vfs setup on
> them so that the slaves can pull down the data they need from S3 when spark
> runs.
>
> There doesn’t seem to be any obvious way online on how to do this or how
> easily accomplish this. Does anyone have some best practices or some ideas
> about how to accomplish this?
>
> An example stack trace when a job is ran on the mesos cluster…
>
> Any idea how to get this going? Like somehow bootstrapping spark on run or
> something?
>
> Thanks,
> Steve
>
>
> java.io.IOException: Unsupported scheme s3n for URI s3n://removed
> at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> at
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
> 1)
> java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
> for URI s3n://removed
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unsupported scheme s3n for URI
> s3n://removed
> at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> at
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> ... 8 more
>
> This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. Whil

Re: s3 vfs on Mesos Slaves

2015-05-13 Thread jay vyas
Might I ask why vfs?  I'm new to vfs and not sure wether or not it predates
the hadoop file system interfaces (HCFS).

After all spark natively supports any HCFS by leveraging the hadoop
FileSystem api and class loaders and so on.

So simply putting those resources on your classpath should be sufficient to
directly connect to s3. By using the sc.hadoopFile (...) commands.
On May 13, 2015 12:16 PM, "Akhil Das"  wrote:

> Did you happened to have a look at this https://github.com/abashev/vfs-s3
>
> Thanks
> Best Regards
>
> On Tue, May 12, 2015 at 11:33 PM, Stephen Carman 
> wrote:
>
> > We have a small mesos cluster and these slaves need to have a vfs setup
> on
> > them so that the slaves can pull down the data they need from S3 when
> spark
> > runs.
> >
> > There doesn’t seem to be any obvious way online on how to do this or how
> > easily accomplish this. Does anyone have some best practices or some
> ideas
> > about how to accomplish this?
> >
> > An example stack trace when a job is ran on the mesos cluster…
> >
> > Any idea how to get this going? Like somehow bootstrapping spark on run
> or
> > something?
> >
> > Thanks,
> > Steve
> >
> >
> > java.io.IOException: Unsupported scheme s3n for URI s3n://removed
> > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> > at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
> > at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
> > 1)
> > java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
> > for URI s3n://removed
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
> > at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> > at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
> > at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.io.IOException: Unsupported scheme s3n for URI
> > s3n://removed
> > at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> > at
> >
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> > ... 8 more
> >
> > This e-mail is intended solely for the above-mentioned recipient and it
> > may contain confidential or privileged information. If you have received
> it
> > in error, please notify us immediately and delete the e-mail. You must
> not
> > copy, distribute, disclose or take any action in reliance on it. In
> > addition, the contents of an attachment to this e-mail may contain
> software
> > viruses which could damage your own computer system. While ColdLight
> > Solutions, LLC has taken every reasonable precaution to minimize this
> risk,
> > we cannot accept liability for any damage which you sustain as a result
> of
> > software viruses. You should perform your own virus checks before opening
> > the attachment.
> >
>


Re: s3 vfs on Mesos Slaves

2015-05-13 Thread Akhil Das
Did you happened to have a look at this https://github.com/abashev/vfs-s3

Thanks
Best Regards

On Tue, May 12, 2015 at 11:33 PM, Stephen Carman 
wrote:

> We have a small mesos cluster and these slaves need to have a vfs setup on
> them so that the slaves can pull down the data they need from S3 when spark
> runs.
>
> There doesn’t seem to be any obvious way online on how to do this or how
> easily accomplish this. Does anyone have some best practices or some ideas
> about how to accomplish this?
>
> An example stack trace when a job is ran on the mesos cluster…
>
> Any idea how to get this going? Like somehow bootstrapping spark on run or
> something?
>
> Thanks,
> Steve
>
>
> java.io.IOException: Unsupported scheme s3n for URI s3n://removed
> at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> at
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
> 1)
> java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
> for URI s3n://removed
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unsupported scheme s3n for URI
> s3n://removed
> at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
> at
> com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
> at
> com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
> ... 8 more
>
> This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
>


s3 vfs on Mesos Slaves

2015-05-12 Thread Stephen Carman
We have a small mesos cluster and these slaves need to have a vfs setup on them 
so that the slaves can pull down the data they need from S3 when spark runs.

There doesn’t seem to be any obvious way online on how to do this or how easily 
accomplish this. Does anyone have some best practices or some ideas about how 
to accomplish this?

An example stack trace when a job is ran on the mesos cluster…

Any idea how to get this going? Like somehow bootstrapping spark on run or 
something?

Thanks,
Steve


java.io.IOException: Unsupported scheme s3n for URI s3n://removed
at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
at 
com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
at 
com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
at 
com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
at 
com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID 1)
java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n for URI 
s3n://removed
at 
com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Unsupported scheme s3n for URI s3n://removed
at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
at 
com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
at 
com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
at 
com.coldlight.neuron.data.ClquetPartitionedData$Iter.(ClquetPartitionedData.java:330)
at 
com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
... 8 more

This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.