Hi Cherios, I didn't mean to use any external engine or Riak, I meant that you should learn what they do so that you can get ideas from these frameworks/engines.
The thing with map reduce is the ability to split data sets among nodes where a query is the sum of the filtered dataset per node, so how are going to indexes of your data per node? The answer to that? sharding, once an actor starts on a node, each actor will tell the map reduce engine (your engine), I'm residing here, index me. Do you get the idea now? I don't think distributed data is ideal for that because you need to split the work load among nodes, the filtering task can be heavy hence you need something that partition and rebuild the index data for you in case of a node going down, hence sharding is ideal, say you have the following: node 1 has items 1, 2 and 3 node 2 has items 4, 5 and 6 node 3 has items 7, 8 and 9 each item has indexable properties like name, description etc. so, how do you query? you send the query to each node to a query coordinator which will prepare the result and send it back, but you send such query to each node and each node answer back to you right? then you concatenate that result and do something else, by you sharding you are automatically scaling, see my points now? HTH, Guido. On Thursday, March 31, 2016 at 3:35:26 AM UTC+1, Chelios wrote: > > Hey Guido, > > Thanks heaps for this info. I only have small theoretical experience with > map reduce. I will have to study on the info you gave me. > > The reason I thought of not using any external database is because I'm > trying to get every small Actor (Customer, Product etc) manage it's on > small piece of data and live anywhere on the cluster. Hoping this will get > rid the problem of sharding and partitioning the database. If I used Riak, > the data will be living in Riak instead of the Actors I instantiated in my > application and I'm trying to manage the data by myself. I'm not sure if > this is a good idea or not. But your comments are helping me. > > Apache Crunch looks great, may be there is a Scala client for this. I will > read up on it more. > > > > > On Wednesday, March 30, 2016 at 9:53:51 PM UTC+11, Guido Medina wrote: >> >> Even if you want to do it yourself you still have to reduce data from a >> map, there are papers if you want to create your own implementation of a >> "map reduce engine" >> You won't escape that fact if you want your implementation to be >> competitive, take a look at Riak, they do the same in Erlang, they have >> actors too, and they still have to use BloomFilters from Google. >> >> They all basically copies of the same paper which basically tells you >> ways to reduce data very fast using well known hashing techniques. >> >> Guido. >> >> On Wednesday, March 30, 2016 at 11:49:24 AM UTC+1, Guido Medina wrote: >>> >>> Hi Chelios, >>> >>> The problem you are solving is divided in two and I think it has been >>> resolved before though it is quite complex but if you divide and conquer it >>> might turn out to be easy. >>> IMHO here are the main aspects of your problem: >>> >>> - Your data is distributed, each node with data will return the >>> result to the node querying it. >>> - A query coordinator actor (one of these has to live on each node >>> for the sake of saving round-trips) will send such query to each node >>> and >>> expect a list of "map reduced" result. >>> >>> The key is to "map reduce", I'm assuming you first want to get the list >>> of actors that comply with your search criteria and then once you have such >>> list do something with them or via "them" >>> In that case you want a map reduce in-memory data structure per node >>> holding data, assuming each node as a list of workers to parallel-ize the >>> query the rest is simple: >>> >>> Some ideas in the following link: >>> http://www.infoq.com/articles/ApacheCrunch >>> >>> HTH, >>> >>> Guido. >>> >>> On Wednesday, March 30, 2016 at 10:09:04 AM UTC+1, Chelios wrote: >>>> >>>> Hi Konrad, >>>> >>>> Your reply gave me the confidence to continue with implementing the >>>> Actor based search. Thank You :D ... I'm doing this just for research >>>> purposes, I'm just trying to see if I can get a high performant, >>>> distributed, in-memory system by just using Eventsourcing with Akka >>>> Actors without using any other external database or tool, other than an >>>> Eventstore database. >>>> >>>> Can I also attend the workshop ? Seeing something Actor design patterns >>>> to designing a search engine architecture is something I need to learn for >>>> this :) >>>> >>>> Cheers, >>>> Chel >>>> >>>> On Wednesday, March 30, 2016 at 7:29:33 PM UTC+11, Konrad Malawski >>>> wrote: >>>>> >>>>> Technically it's doable , but I'm not sure if that'll reduce >>>>> complexity :-) >>>>> Search really has to be "good" in order to be useful, just "fast but >>>>> bad results" often won't satisfy anyone, >>>>> thus I'm not sure implementing your own custom search engine is a good >>>>> idea (unless that is exactly the goal of your business >>>>> – be a search engine). >>>>> >>>>> A fun fact, one of the workshops I do is basically that, a multi-tier >>>>> search engine architecture, however it depends if your entire job is to >>>>> build the search engine, or you just should use an out of the box one >>>>> because it's one of the 100 things you work on :-) >>>>> >>>>> -- >>>>> Cheers, >>>>> Konrad 'ktoso’ Malawski >>>>> <http://akka.io>Akka <http://akka.io> @ Lightbend >>>>> <http://typesafe.com> >>>>> <http://lightbend.com> >>>>> >>>>> On 30 March 2016 at 08:42:57, Chelios (chelios....@gmail.com) wrote: >>>>> >>>>> Hey guys >>>>> >>>>> I've got an Eventsource based application (Not CQRS - Read and write >>>>> are both on the write side). The state of all the >>>>> entities/aggregates/actor >>>>> are stored in memory because the data is not going to go above 120GB and >>>>> I've have a machine with 265GB RAM. >>>>> >>>>> *Problem:* >>>>> Suppose I have a million Products where each *Product* is an Actor >>>>> supervised by *ProductSupervisorActor* and I want to perform the >>>>> following query: >>>>> *Query*: Find all the products where the *product description* >>>>> matches some user input. >>>>> >>>>> I'm wondering if I could get away with just querying the state of the >>>>> million actors and aggregating the result into one >>>>> *SearchRequestHandlerActor* instead of using a search database like >>>>> SOLR ? I've used SOLR before and it's super fast but I'm just trying to >>>>> reduce the complexity in my application. If the state is already in >>>>> memory >>>>> may be I can just find a way to query it instead of introducing another >>>>> moving part (SOLR) into the system that I have to manage and make sure >>>>> that >>>>> the data is synchronized. >>>>> >>>>> I would really like to find a solution to perform the above query >>>>> efficiently by just using Actors with paging. If I can achieve this then >>>>> I >>>>> can have *ProductActor*s running anywhere in a cluster and the search >>>>> would work just fine. Instead, if I was using SOLR I would have to shard >>>>> or >>>>> partition the database which just another hassle. >>>>> >>>>> RIght now I've got a *ProductSearchRequestHandlerActor* which*,* on >>>>> initilization, accepts *totalNumberOfMessagesExpect: Long* and >>>>> accepts messages of type *Option[ProductState]* until the >>>>> *totalNumberOfMessagesExpect* is reached. *I have not implemented >>>>> paging yet.* >>>>> >>>>> I just wanted to get your opinion or ideas on how I can achieve this >>>>> efficiently or any tips or I'm being silly for trying this because there >>>>> is >>>>> no central index ? >>>>> >>>>> Chel >>>>> -- >>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>> >>>>>>>>>> Check the FAQ: >>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>> >>>>>>>>>> Search the archives: >>>>> https://groups.google.com/group/akka-user >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Akka User List" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to akka-user+...@googlegroups.com. >>>>> To post to this group, send email to akka...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/akka-user. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.