Thanks for the feedback everyone. We've had a look at different SQL based
solutions, and have got good performance out of them, but some of the
reports we make can't be generated with a single bit of SQL. This is just
an investigation to see if Spark is a viable alternative.
I've got another
I agree with the others that a dedicated NoSQL datastore can make sense. You
should look at the lambda architecture paradigm. Keep in mind that more memory
does not necessarily mean more performance. It is the right data structure for
the queries of your users. Additionally, if your queries
Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for
Hi Allan,
Where is the data stored right now? If it's in a relational database, and you
are using Spark with Hadoop, I feel like it would make sense to move the import
the data into HDFS, just because it would be faster to access the data. You
could use Sqoop to do that.
In terms of having a
Hi,
I am looking to use Spark to help execute queries against a reasonably
large dataset (1 billion rows). I'm a bit lost with all the different
libraries / add ons to Spark, and am looking for some direction as to what
I should look at / what may be helpful.
A couple of relevant points:
- The