Luke,
Thank you for your comments - please see my revised proposal below.
Luke Baker wrote:
On 10/21/2004 11:30 AM, Andrzej Bialecki wrote:
I think you're right here for the most part. Wouldn't using NDFS for fetching, be somewhat limiting in certain circumstances where the transfer/writing of the data is helping saturate the network connection that it slows down the fetching? I would think that this "limitation" is not really something to worry about now, except keeping it in the back or our minds as we plan how to distribute the search.
This also depends on the topology of your production network. Fetching and NDFS writing don't have to use the same network segment...
My idea is to operate on units that make sense to DS$Server - which means these units must be a part of segment data. Assuming we have the tools to cut&paste fragments of segment data as we wish (which we have, they just need to be wrapped in command-line tools), I propose the following scenario then:
Are you saying that we have tools to cut&paste fragments of indexed segments as we please? Meaning we can basically create index segments of any size we want and rearrange them as we want, within reason? Or
We can split/merge non-indexed segments. It is not possible (AFAIK) to split a Lucene index, but it's possible to merge it with another.
are you talking about segments of fetched content (non-indexed)? Perhaps
Yes - they may already be parsed, but still non-indexed.
2. I guess we should dedup the data before distributing it, otherwise it would be more difficult - but it would be nice to be able to do it in step 4...
If we dedup here, how can we make sure we delete the duplicates of stuff in this "new" content and data already being searched on the search servers?
Well, I have several ideas for that, I'm just not sure which way would be the most efficient...
One way is to tell each server to deduplicate the documents locally, right after it receives a new slice. However, deduplication is quite resource-intensive...
In the following paragraphs I describe a different approach, which I think gives us most flexibility, and avoids merging and deduplication on search servers. I assume that the system is set up to have a single master DB server, which contains the database and a copy of all segments; there are multiple search servers (nodes), which perform searching over local indexes, and there is one or more search frontends which dispatch queries to the search servers.
Segments are a natural unit of work for fetching and updating the DB. However, in parallel to segments we would now create "slices", i.e. fragments of segments of fixed size, for deployment purpose. Each slice would contain e.g. 1 mln entries (which coresponds roughly to 20GB of segment data - 1 mln doc. seems like a good unit from the point of view of Lucene's MultiReader).
The slices wouldn't have to be populated with the real content right away, they could be created just before deployment, based on entries from a "slice DB", containing segment/docNum references for each generated sliceId. That "slice DB" would always keep the track of which slices contain which versions of documents, i.e. the slice DB would keep tuples:
<sliceId, entryNum, segmentId, docNum,
urlHash, contentHash, score, lastFetched>(urlHash, contentHash, score and lastFetched are all required data when doing a standard deduplication). In addition, the slice DB would keep a list of active/inactive slices.
Then you can perform deduplication on the master DB server, using the data from the newly fetched segments and the data from the slice DB, i.e. the working set for the dedup operation would be the active entries from slice DB + documents from the new segments.
Unique data from the new segments would be added to form new slices. The deduplication process would also result in creating "delete lists" for old slices, to remove obsolete copies of content from search servers. These delete lists would be applied to the "slice DB" to invalidate old entries, and they would also be distributed to appropriate search servers (i.e. those that contain relevant slices) - the delete operation on Lucense index is relatively quick and simple. New slices would be also distributed, potentially to other search servers that the ones that received delete lists. New slices would be added to "slice DB" as active slices.
It would also make sense, after reaching a certain threshold of deleted entries in a single slice, to migrate all data from that slice that is still valid to new slices, and invalidate the whole slice. This would be roughly equivalent to a distributed version of SegmentMergeTool.
I think this should solve the problem of local duplicates on search servers. Additional measures are still needed to ensure that many replicas of each slice can co-exist in the cluster, and still to get valid results from search servers (without duplicate hits, this time related to using multiple slice replicas). See below for my comments on that.
4. now I guess there is some work to do on the search server. The newly received slice needs to be indexed and de-duplicated with the already existing older slices on the server. It would be nice to have some method to do this across the whole cluster of search servers before the slices are sent to search servers, but if not, the global de-duplication must take place in step 2.
Skip that. I think the scenario described above should solve this issue.
How would this work?
(1) Have all the search servers send the relevant info (URL, content md5, etc.) to a single location (NDFS?).
Let's forget the NDFS for a moment - it is useful just for bulk data storage, not for operations related to indexing and searching.
For now I consider a scenario in which the smallest block (shard, slice or whatever...) of data is not an opaque blob, but is a usable part of fetched segment data of pre-defined size.
(2) Find duplicates and which to delete on one server/process.
(3) Then tell the search servers what to delete from their indexes.
I'm just throwing this out, and I'm not sure about the performance/scalability of it or even if there is a worthwhile advantage to doing it this way.
It's relatively easy to remove individual documents from a search index (their content will stay around, but they won't be returned in search hits). However, after you performed many delete operations Lucene indexes become less efficient and need to be optimized, which is a resource-intensive operation - but still less intensive than re-creating the indexes.
Again, see the above paragraphs for a possible solution.
It sounds like you're implying that there would only be 1 copy of a slice active at a given time. Is that correct? This makes sense in a situation where say we have 3 search servers and each server can only reasonably handle 1/3 of the total indexed data. However to increase throughput of searches someone might want to have 9 servers that each can handle 1/3 of the total indexed data, so it'd be good to have the search master do load balancing of the searches. Adding this ability
Yes, you are right - I overlooked this in the original post...
though means we can't just broadcast the searches to every search server. We need to make sure we don't search multiple, active "shards" which are on different search servers. Perhaps the solution would be to have the master server keep track of every "shard" and which servers each is on and if it is active on each server. Then when searching, the master sends (or tells clients to send) the searches to a subset of search servers but also tells each search server which "shard(s)" that the server should return results for. Maybe this is already what you were planning, if so ignore this. :-)
No, I didn't plan this yet, but I think your idea would nicely play with the "slice DB" concept.
Let's see: when you deploy the slices to search servers, you could also deploy the current list of active slices, to the search frontend (currently, the DistributedSearch$Client instance). I think we would have to skip the current schema of using a static text file for configuring the list of available search servers - perhaps some solution based on multicast polling would be better, to discover the current set of active search servers.
The search frontend would periodically issue multicast heartbeat messages to check which search servers are alive at the moment, and each server would send in response the following information:
<unicastIP, currentLoad, spaceAvail, sliceId[]>
where unicastIP is a unicast IP address to use when contacting the server, currentLoad expresses its current load, spaceAvail is available disk space, and sliceId[] is an array of active slices handled by this search server. Based on this information, the search frontend would keep a constantly updated list of which servers it should query, and which sliceId(s) these servers should use for searching the query.
If a search server doesn't respond to a heartbeat, the search frontend should assume it's dead and it should initiate an appropriate replication procedure, based on the space available on other nodes and their current load. Appropriate slices should be replicated from other nodes that contain them (or from the "slice DB" server). This replication should probably take place out-of-band, i.e. directly between the search nodes and/or the slice DB. A special case to this procedure is when a new (empty) search node appears on the net - then the search frontend will initiate replication of multiple slices from other search nodes, to ensure appropriate replication and load balancing (e.g. by replicating slices from the most loaded servers).
This is my revised proposal for now, let's discuss now if it fits the bill...
Thanks again for excellent comments!
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
