On 5/16/2016 9:53 AM, Yuval Itzchakov wrote:

AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed.



There is another option in case of RDDs - the Apache Ignite project - a memory grid/distributed cache that supports Spark RDDs. The nice thing about Ignite is that everything is done automatically for you, you can also duplicate caches for resiliency, load caches from disk, partition them etc. and you also get automatic spillover to SQL (and NoSQL) capable backends via read/write through capabilities. I think there is also effort to support dataframes. Ignite supports standard SQL to query the caches too.

On Mon, May 16, 2016, 19:47 Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:

    I have a curiosity question. These forever/unlimited
    DataFrames/DataSets will persist and be query capable. I still am
    foggy about how this data will be stored. As far as I know, memory
    is finite. Will the data be spilled to disk and be retrievable if
    the query spans data not in memory? Is Tachyon (Alluxio), HDFS
    (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL),
    Object Store (S3, Swift), or any else I can’t think of going to be
    the underlying near real-time storage system?

    Thanks,
    Ben


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to