Writing to Parquet and querying the result via SparkSQL works great (except for
some strange SQL parser errors). However the problem remains, how do I get that
data back to a dashboard. So I guess I’ll have to use a database after all.
You can batch up data store into parquet partitions as
If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the
usual route with either read-only or normal database.
On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:
however, the cache is not guaranteed to remain, if other jobs are launched in
the cluster
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote:
Thank you guys, I’ll try Parquet and if that’s not
So you are living the dream of using HDFS as a database? ;)
On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote:
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)
aℕdy ℙetrella
about.me/noootsab
On Mon, Sep 15, 2014 at 1:41 PM, Marius
You can cache data in memory query it using Spark Job Server.
Most folks dump data down to a queue/db for retrieval
You can batch up data store into parquet partitions as well. query it using
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
--
Regards,
Mayur
however, the cache is not guaranteed to remain, if other jobs are launched
in the cluster and require more memory than what's left in the overall
caching memory, previous RDDs will be discarded.
Using an off heap cache like tachyon as a dump repo can help.
In general, I'd say that using a
Hi there,
I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote
Scalding jobs - one-off, read data from HDFS, count words, write counts back to
HDFS.
Now I want to display these counts in a dashboard. Since Spark allows to cache
RDDs in-memory and you have to