Hello all, thanks for your generous help.
I think I now know everything: (What I want to do is to build a web crawler and index the documents found). I will start with the setup as suggested by Ephraim (Several sharded masters, each with at least one slave for reads and some aggregators for querying). This is only a prototype to learn more... And the Google PDF from Walter is very interesting, that is something that I can then try if I hit the limits with the setup above. But before that, I have to learn much more about all this indexing / index building and solr/lucene stuff. Thanks again for your help!! best regards jens 2011/4/7 Walter Underwood <wun...@wunderwood.org> > On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote: > > > Walter, thanks for the advice: Well you are right, mentioning google. My > > question was also to understand how such large systems like > google/facebook > > are actually working. So my numbers are just theoretical and made up. My > > system will be smaller, but I would be very happy to understand how such > > large systems are build and I think the approach Ephraim showd should be > > working quite well at large scale. > > Understanding what Google does will NOT help you build your engine. Just > like understanding a F1 race car does not help you build a Toyota Camry. One > is built for performance only, and requires LOTS of support, the other for > supportability and stability. Very different engineering goals and designs. > > Here is one view of Google's search setup: > http://www.linesave.co.uk/google_search_engine.html > > This talk gives a lot more detail. Summary in the blog post, slides in the > PDF. Google's search is entirely in-memory. They load off disk and run. > > http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html > http://research.google.com/people/jeff/WSDM09-keynote.pdf > > How big will your system be? Does it require real-time updates? > > wunder > -- > Walter Underwood > Lead Engineer, MarkLogic > >