Great. Thank you very much Michael :-D On Mon, Oct 27, 2014 at 2:03 PM, Michael Armbrust <mich...@databricks.com> wrote:
> I'd suggest checking out the Spark SQL programming guide to answer this > type of query: > http://spark.apache.org/docs/latest/sql-programming-guide.html > > You could also perform it using the raw Spark RDD API > <http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>, > but its often the case that the in-memory columnar caching of spark SQL is > faster and more space efficient. > > On Mon, Oct 27, 2014 at 6:27 AM, Peter Wolf <opus...@gmail.com> wrote: > >> I agree. I'd like to avoid SQL >> >> If I could store everything in Cassandra or Mongo and process in Spark, >> that would be far preferable to creating a temporary Working Set. >> >> I'd like to write a performance test. Lets say I have two large >> collections A and B. Each collection has 2 columns and many many rows. >> The columns are Id and Value. >> >> I want to create a third collection that is the equivalent of the SQL >> query >> >> select A.Id, A.Value, B.Value where A.Id = B.Id >> >> This new collection is the inner join of A and B. It has 3 columns A.Id, >> A.Value, B.Value and one row for each Id that A and B have in common. >> >> Furthermore, this table is only needed temporarily as part of >> processing. It needs to created efficiently, and accessible quickly. >> >> Can someone give me a pointer to the appropriate API and/or example code? >> >> Thanks again >> P >> >> On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas < >> michael.hausenb...@gmail.com> wrote: >> >>> >>> > Given that you are storing event data (which is basically things that >>> have happened in the past AND cannot be modified) you should definitely >>> look at Event sourcing. >>> > http://martinfowler.com/eaaDev/EventSourcing.html >>> >>> >>> Agreed. In this context: a lesser known fact is that the Lambda >>> Architecture is, in a nutshell, an extension of Fowler’s ES, so you might >>> also want to check out: >>> >>> https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark >>> >>> >>> Cheers, >>> Michael >>> >>> -- >>> Michael Hausenblas >>> Ireland, Europe >>> http://mhausenblas.info/ >>> >>> > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com> >>> wrote: >>> > >>> > Given that you are storing event data (which is basically things that >>> have happened in the past AND cannot be modified) you should definitely >>> look at Event sourcing. >>> > http://martinfowler.com/eaaDev/EventSourcing.html >>> > >>> > If all you are doing is storing events then I don't think you need a >>> relational database. Rather an event log is ideal. Please see - >>> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying >>> > >>> > There are many other datastores that can do a better job at storing >>> your events. You can process your data and then store them in a relational >>> database to query later. >>> > >>> > >>> > >>> > >>> > >>> > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote: >>> > Thanks for all the useful responses. >>> > >>> > We have the usual task of mining a stream of events coming from our >>> many users. We need to store these events, and process them. We use a >>> standard Star Schema to represent our data. >>> > >>> > For the moment, it looks like we should store these events in SQL. >>> When appropriate, we will do analysis with relational queries. Or, when >>> appropriate we will extract data into working sets in Spark. >>> > >>> > I imagine this is a pretty common use case for Spark. >>> > >>> > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson < >>> rick.richard...@gmail.com> wrote: >>> > Spark's API definitely covers all of the things that a relational >>> database can do. It will probably outperform a relational star schema if >>> all of your *working* data set can fit into RAM on your cluster. It will >>> still perform quite well if most of the data fits and some has to spill >>> over to disk. >>> > >>> > What are your requirements exactly? >>> > What is massive amounts of data exactly? >>> > How big is your cluster? >>> > >>> > Note that Spark is not for data storage, only data analysis. It pulls >>> data into working data sets called RDD's. >>> > >>> > As a migration path, you could probably pull the data out of a >>> relational database to analyze. But in the long run, I would recommend >>> using a more purpose built, huge storage database such as Cassandra. If >>> your data is very static, you could also just store it in files. >>> > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote: >>> > My understanding is the SparkSQL allows one to access Spark data as if >>> it were stored in a relational database. It compiles SQL queries into a >>> series of calls to the Spark API. >>> > >>> > I need the performance of a SQL database, but I don't care about doing >>> queries with SQL. >>> > >>> > I create the input to MLib by doing a massive JOIN query. So, I am >>> creating a single collection by combining many collections. This sort of >>> operation is very inefficient in Mongo, Cassandra or HDFS. >>> > >>> > I could store my data in a relational database, and copy the query >>> results to Spark for processing. However, I was hoping I could keep >>> everything in Spark. >>> > >>> > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta < >>> soumya.sima...@gmail.com> wrote: >>> > 1. What data store do you want to store your data in ? HDFS, HBase, >>> Cassandra, S3 or something else? >>> > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? >>> > >>> > One option is to process the data in Spark and then store it in the >>> relational database of your choice. >>> > >>> > >>> > >>> > >>> > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> >>> wrote: >>> > Hello all, >>> > >>> > We are considering Spark for our organization. It is obviously a >>> superb platform for processing massive amounts of data... how about >>> retrieving it? >>> > >>> > We are currently storing our data in a relational database in a star >>> schema. Retrieving our data requires doing many complicated joins across >>> many tables. >>> > >>> > Can we use Spark as a relational database? Or, if not, can we put >>> Spark on top of a relational database? >>> > >>> > Note that we don't care about SQL. Accessing our data via standard >>> queries is nice, but we are equally happy (or more happy) to write Scala >>> code. >>> > >>> > What is important to us is doing relational queries on huge amounts of >>> data. Is Spark good at this? >>> > >>> > Thank you very much in advance >>> > Peter >>> > >>> > >>> > >>> > >>> >>> >> >