Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers
imran,
great, i will take a look at the pullreq. seems we are interested in
similar things


On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid iras...@cloudera.com wrote:

 I think writing to hdfs and reading it back again is totally reasonable.
 In fact, in my experience, writing to hdfs and reading back in actually
 gives you a good opportunity to handle some other issues as well:

 a) instead of just writing as an object file, I've found its helpful to
 write in a format that is a little more readable.  Json if efficiency
 doesn't matter :) or you could use something like avro, which at least has
 a good set of command line tools.

 b) when developing, I hate it when I introduce a bug in step 12 of a long
 pipeline, and need to re-run the whole thing.  If you save to disk, you can
 write a little application logic that realizes step 11 is already sitting
 on disk, and just restart from there.

 c) writing to disk is also a good opportunity to do a little crude
 auto-tuning of the number of partitions.  You can look at the size of
 each partition on hdfs, and then adjust the number of partitions.

 And I completely agree that losing the partitioning info is a major
 limitation -- I submitted a PR to help deal w/ it:

 https://github.com/apache/spark/pull/4449

 getting narrow dependencies w/ partitioners can lead to pretty big
 performance improvements, so I do think its important to make it easily
 accessible to the user.  Though now I'm thinking that maybe this api is a
 little clunky, and this should get rolled into the other changes you are
 proposing to hadoop RDD  friends -- but I'll go into more discussion on
 that thread.



 On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 there is a way to reinstate the partitioner, but that requires
 sc.objectFile to read exactly what i wrote, which means sc.objectFile
 should never split files on reading (a feature of hadoop file inputformat
 that gets in the way here).

 On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com
 wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com
 wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.







Re: spark disk-to-disk

2015-03-24 Thread Imran Rashid
I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:

a) instead of just writing as an object file, I've found its helpful to
write in a format that is a little more readable.  Json if efficiency
doesn't matter :) or you could use something like avro, which at least has
a good set of command line tools.

b) when developing, I hate it when I introduce a bug in step 12 of a long
pipeline, and need to re-run the whole thing.  If you save to disk, you can
write a little application logic that realizes step 11 is already sitting
on disk, and just restart from there.

c) writing to disk is also a good opportunity to do a little crude
auto-tuning of the number of partitions.  You can look at the size of
each partition on hdfs, and then adjust the number of partitions.

And I completely agree that losing the partitioning info is a major
limitation -- I submitted a PR to help deal w/ it:

https://github.com/apache/spark/pull/4449

getting narrow dependencies w/ partitioners can lead to pretty big
performance improvements, so I do think its important to make it easily
accessible to the user.  Though now I'm thinking that maybe this api is a
little clunky, and this should get rolled into the other changes you are
proposing to hadoop RDD  friends -- but I'll go into more discussion on
that thread.



On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 there is a way to reinstate the partitioner, but that requires
 sc.objectFile to read exactly what i wrote, which means sc.objectFile
 should never split files on reading (a feature of hadoop file inputformat
 that gets in the way here).

 On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com
 wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.






Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
Maybe implement a very simple function that uses the Hadoop API to read in
based on file names (i.e. parts)?

On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers ko...@tresata.com wrote:

 there is a way to reinstate the partitioner, but that requires
 sc.objectFile to read exactly what i wrote, which means sc.objectFile
 should never split files on reading (a feature of hadoop file inputformat
 that gets in the way here).

 On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com
 wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.






Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info...

On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.




Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which means sc.objectFile
should never split files on reading (a feature of hadoop file inputformat
that gets in the way here).

On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers ko...@tresata.com wrote:

 i just realized the major limitation is that i lose partitioning info...

 On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin r...@databricks.com wrote:


 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


 This seems like a fine solution to me.





spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.

the trouble is, i am not sure how to proceed.

rdd.checkpoint() seems like the obvious candidate to force my computations
to write to hdfs for intermediate data and cut the lineage, but
rdd.checkpoint() does not actually trigger a job. rdd.checkpoint() runs
after some other action triggered a job, leading to recomputation.

the suggestion in the docs is to do:
rdd.cache(); rdd.checkpoint()
but that wont work for me since the data does not fit in memory.

instead i could do:
rdd.persist(StorageLevel.DISK_ONLY_2); rdd.checkpoint()
but that leads to the data being written to disk twice in a row, which
seems wasteful.

so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile(...)
but that seems like a rather broken abstraction.

any ideas? i feel like i am missing something obvious. or i am running yet
again into spark's historical in-memory bias?


Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote:

 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


This seems like a fine solution to me.