RE:

> Given that our agg sizes will exceed memory, we expect to cache them to
disk, so save-as-object (assuming there are no out of the ordinary
performance issues) may solve the problem, but I was hoping to store data
is a column orientated format.  However I think this in general is not
possible - Spark can *read* Parquet, but I think it cannot write Parquet as
a disk-based RDD format.

Spark can write Parquet, via the ParquetOutputFormat which is distributed
from Parquet. If you'd like example code for writing out to Parquet, please
see the adamSave function in
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala,
starting at line 62. There is a bit of setup necessary for the Parquet
write codec, but otherwise it is fairly straightforward.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466


On Thu, Jun 12, 2014 at 7:03 AM, Christopher Nguyen <c...@adatao.com> wrote:

> Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
> for your use case. As for Parquet support, that's newly arrived in Spark
> 1.0.0 together with SparkSQL so continue to watch this space.
>
> Gerard's suggestion to look at JobServer, which you can generalize as
> "building a long-running application which allows multiple clients to
> load/share/persist/save/collaborate-on RDDs" satisfies a larger, more
> complex use case. That is indeed the job of a higher-level application,
> subject to a wide variety of higher-level design choices. A number of us
> have successfully built Spark-based apps around that model.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass <t...@avocet.io> wrote:
>
>> On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas <gerard.m...@gmail.com>
>> wrote:
>>
>>> The goal of rdd.persist is to created a cached rdd that breaks the DAG
>>> lineage. Therefore, computations *in the same job* that use that RDD can
>>> re-use that intermediate result, but it's not meant to survive between job
>>> runs.
>>>
>>
>> As I understand it, Spark is designed for interactive querying, in the
>> sense that the caching of intermediate results eliminates the need to
>> recompute those results.
>>
>> However, if intermediate results last only for the duration of a job
>> (e.g. say a python script), how exactly is interactive querying actually
>> performed?   a script is not an interactive medium.  Is the shell the only
>> medium for interactive querying?
>>
>> Consider a common usage case : a web-site, which offers reporting upon a
>> large data set.  Users issue arbitrary queries.  A few queries (just with
>> different arguments) dominate the query load, so we thought to create
>> intermediate RDDs to service those queries, so only those order of
>> magnitude or smaller RDDs would need to be processed.  Where this is not
>> possible, we can only use Spark for reporting by issuing each query over
>> the whole data set - e.g. Spark is just like Impala is just like Presto is
>> just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
>> - so profoundly useful here - is not available.  What a huge and unexpected
>> loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
>> surprised to find this functionality is not available.
>>
>>
>>> If you need to ad-hoc persist to files, you can can save RDDs using
>>> rdd.saveAsObjectFile(...) [1] and load them afterwards using
>>> sparkContext.objectFile(...)
>>>
>>
>> I've been using this site for docs;
>>
>> http://spark.apache.org
>>
>> Here we find through the top-of-the-page menus the link "API Docs" ->
>> ""Python API" which brings us to;
>>
>> http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html
>>
>> Where this page does not show the function saveAsObjectFile().
>>
>> I find now from your link here;
>>
>>
>> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
>>
>> What appears to be a second and more complete set of the same
>> documentation, using a different web-interface to boot.
>>
>> It appears at least that there are two sets of documentation for the same
>> APIs, where one set is out of the date and the other not, and the out of
>> date set is that which is linked to from the main site?
>>
>> Given that our agg sizes will exceed memory, we expect to cache them to
>> disk, so save-as-object (assuming there are no out of the ordinary
>> performance issues) may solve the problem, but I was hoping to store data
>> is a column orientated format.  However I think this in general is not
>> possible - Spark can *read* Parquet, but I think it cannot write Parquet as
>> a disk-based RDD format.
>>
>> If you want to preserve the RDDs in memory between job runs, you should
>>> look at the Spark-JobServer [3]
>>>
>>
>> Thankyou.
>>
>> I view this with some trepidation.  It took two man-days to get Spark
>> running (and I've spent another man day now trying to get a map/reduce to
>> run; I'm getting there, but not there yet) - the bring-up/config experience
>> for end-users is not tested or accurated documented (although to be clear,
>> no better and no worse than is normal for open source; Spark is not
>> exceptional).  Having to bring up another open source project is a
>> significant barrier to entry; it's always such a headache.
>>
>> The save-to-disk function you mentioned earlier will allow intermediate
>> RDDs to go to disk, but we do in fact have a use case where in-memory would
>> be useful; it might allow us to ditch Cassandra, which would be wonderful,
>> since it would reduce the system count by one.
>>
>> I have to say, having to install JobServer to achieve this one end seems
>> an extraordinarily heavyweight solution - a whole new application, when all
>> that is wished for is that Spark persists RDDs across jobs, where so small
>> a feature seems to open the door to so much functionality.
>>
>>
>>
>

Reply via email to