On 5/16/2016 9:53 AM, Yuval Itzchakov wrote:
AFAIK, the underlying data represented under the DataSet[T]
abstraction will be formatted in Tachyon under the hood, but as with
RDD's if needed they will be spilled to local disk on the worker of
needed.
There is another option in case of RDDs - the Apache Ignite project - a
memory grid/distributed cache that supports Spark RDDs. The nice thing
about Ignite is that everything is done automatically for you, you can
also duplicate caches for resiliency, load caches from disk, partition
them etc. and you also get automatic spillover to SQL (and NoSQL)
capable backends via read/write through capabilities. I think there is
also effort to support dataframes. Ignite supports standard SQL to query
the caches too.
On Mon, May 16, 2016, 19:47 Benjamin Kim <bbuil...@gmail.com
<mailto:bbuil...@gmail.com>> wrote:
I have a curiosity question. These forever/unlimited
DataFrames/DataSets will persist and be query capable. I still am
foggy about how this data will be stored. As far as I know, memory
is finite. Will the data be spilled to disk and be retrievable if
the query spans data not in memory? Is Tachyon (Alluxio), HDFS
(Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL),
Object Store (S3, Swift), or any else I can’t think of going to be
the underlying near real-time storage system?
Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org