Re: ephemeral storage level in spark ?
The off-heap storage level is currently tied to Tachyon, but it might support other forms of off-heap storage later. However it’s not really designed to be mixed with the other ones. For this use case you may want to rely on memory locality and have some custom code to push the data to the accelerator. If you can think of a way to extend the storage level concept to handle this that would be general though, do send a proposal. Matei On Apr 5, 2014, at 5:14 PM, Mridul Muralidharan mri...@gmail.com wrote: No, I am thinking along lines of writing to an accelerator card or dedicated card with its own memory. Regards, Mridul On Apr 6, 2014 5:19 AM, Haoyuan Li haoyuan...@gmail.com wrote: Hi Mridul, Do you mean the scenario that different Spark applications need to read the same raw data, which is stored in a remote cluster or machines. And the goal is to load the remote raw data only once? Haoyuan On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan mri...@gmail.com wrote: Hi, We have a requirement to use a (potential) ephemeral storage, which is not within the VM, which is strongly tied to a worker node. So source of truth for a block would still be within spark; but to actually do computation, we would need to copy data to external device (where it might lie around for a while : so data locality really really helps if we can avoid a subsequent copy if it is already present on computations on same block again). I was wondering if the recently added storage level for tachyon would help in this case (note, tachyon wont help; just the storage level might). What sort of guarantees does it provide ? How extensible is it ? Or is it strongly tied to tachyon with only a generic name ? Thanks, Mridul -- Haoyuan Li Algorithms, Machines, People Lab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
ephemeral storage level in spark ?
Hi, We have a requirement to use a (potential) ephemeral storage, which is not within the VM, which is strongly tied to a worker node. So source of truth for a block would still be within spark; but to actually do computation, we would need to copy data to external device (where it might lie around for a while : so data locality really really helps if we can avoid a subsequent copy if it is already present on computations on same block again). I was wondering if the recently added storage level for tachyon would help in this case (note, tachyon wont help; just the storage level might). What sort of guarantees does it provide ? How extensible is it ? Or is it strongly tied to tachyon with only a generic name ? Thanks, Mridul
Re: ephemeral storage level in spark ?
Hi Mridul, Do you mean the scenario that different Spark applications need to read the same raw data, which is stored in a remote cluster or machines. And the goal is to load the remote raw data only once? Haoyuan On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan mri...@gmail.comwrote: Hi, We have a requirement to use a (potential) ephemeral storage, which is not within the VM, which is strongly tied to a worker node. So source of truth for a block would still be within spark; but to actually do computation, we would need to copy data to external device (where it might lie around for a while : so data locality really really helps if we can avoid a subsequent copy if it is already present on computations on same block again). I was wondering if the recently added storage level for tachyon would help in this case (note, tachyon wont help; just the storage level might). What sort of guarantees does it provide ? How extensible is it ? Or is it strongly tied to tachyon with only a generic name ? Thanks, Mridul -- Haoyuan Li Algorithms, Machines, People Lab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: ephemeral storage level in spark ?
No, I am thinking along lines of writing to an accelerator card or dedicated card with its own memory. Regards, Mridul On Apr 6, 2014 5:19 AM, Haoyuan Li haoyuan...@gmail.com wrote: Hi Mridul, Do you mean the scenario that different Spark applications need to read the same raw data, which is stored in a remote cluster or machines. And the goal is to load the remote raw data only once? Haoyuan On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan mri...@gmail.com wrote: Hi, We have a requirement to use a (potential) ephemeral storage, which is not within the VM, which is strongly tied to a worker node. So source of truth for a block would still be within spark; but to actually do computation, we would need to copy data to external device (where it might lie around for a while : so data locality really really helps if we can avoid a subsequent copy if it is already present on computations on same block again). I was wondering if the recently added storage level for tachyon would help in this case (note, tachyon wont help; just the storage level might). What sort of guarantees does it provide ? How extensible is it ? Or is it strongly tied to tachyon with only a generic name ? Thanks, Mridul -- Haoyuan Li Algorithms, Machines, People Lab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/