Just wanted to push the topic a little bit, because those question come up quite often and it's very interesting for me.
Thank you! - Mitch MitchK wrote: > > Hello community and a nice satureday, > > from several discussions about Solr and Nutch, I got some questions for a > virtual web-search-engine. > > The requirements: > I. I need a scalable solution for a growing index that becomes larger than > one machine can handle. If I add more hardware, I want to linear improve > the performance. > > II. I want to use technologies like the OPIC-algorithm (default algorithm > in Nutch) or PageRank or... whatever is out there to improve the ranking > of the webpages. > > III. I want to be able to easily add more fields to my documents. Imagine > one retrives information from a webpage's content, than I want to make it > searchable. > > IV. While fetching my data, I want to make special-searches possible. For > example I want to retrive pictures from a webpage and want to index > picture-related content into another search-index plus I want to save a > small thumbnail of the picture itself. Btw: This is (as far as I know) not > possible with solr, because solr was not intended to do such special > indexing-logic. > > V. I want to use filter queries (i.e. main-query "christopher lee" returns > 1.5mio results, subquery "action" -> the main-query would be a > filter-query and "action" would be the actual query. So a search within > search-results would be easily made available). > > VI. I want to be able to use different logics for different pages. Maybe I > got a pool of 100 domains that I know better than others and I got special > scripts that retrive more special information from those 100 domains. Than > I want to apply my special logic to those 100 domains, but every other > domain should use the default logic. > > ----------------- > > The project is only virtual. So why I am asking? > I want to learn more about websearch and I would like to make some new > experiences. > > What do I know about Solr + Nutch: > As it is said on lucidimagination.com, Solr + Nutch does not scale if the > index is too large. > The article was a little bit older and I don't know whether this problem > gets fixed with the new distributed abilities of Solr. > > Furthermore I don't want to index the pages with nutch and reindex them > with solr. > The only exception would be: If the content of a webpage get's indexed by > nutch, I want to use the already tokenized content of the body with some > Solr copyfield operations to extend the search (i.e. making fuzzy search > possible). At the moment: I don't think this is possible. > > I don't know much about the droids project and how well it is documented. > But from what I can read by some posts of Otis, it seems to be usable as a > crawler-framework. > > > Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it > is a scaling-monster (from what I've read). > > Cons: The search is not as rich as it is possible with Solr. Extend > Nutch's search-abilities *seems* to be more complicated than with Solr. > Furthermore, if I want to use Solr to search nutch's index, looking at my > requirements I would need to reindex the whole thing - without the > benefits of Hadoop. > > What I don't know at the moment is, how it is possible to use algorithms > like in II. mentioned with Solr. > > I hope you understand the problem here - Solr *seems* to me as it would > not be the best solution for a web-search-engine, because of scaling > reasons in indexing. > > > Where should I dive deeper? > Solr + Droids? > Solr + Nutch? > Nutch + howToExtendNutchToMakeSearchBetter? > > > Thanks for the discussion! > - Mitch > -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html Sent from the Solr - User mailing list archive at Nabble.com.