mapSearcher was Re: Index update and Google Dance
Hi Doug, In the future I would like to implement a more automated distributed search system than Nutch currently has. One way to do this might be to use MapReduce. Each map task's input could be an index and some segment data. The map method would serve queries, i.e., run a Nutch DistributedSearch.Server. It would first copy the index out of NDFS to the local disk, for better performance. I have 2 questions regarding this mechanism. First, what you plan to make the running search servers known by the master (search client) I can imaging a similar mechanism as the tasktracker and jobtracker use, a kind of heart beat message. Second wouldn't be there also a possibility to solve nutch-92 (DistributedSearch incorrectly scores results) by first running a map reduce task over the indexes that counting terms and than hold this somehow in the memory of master (search server client). But I'm not sure if that is may to much data. Stefan
Re: Index update and Google Dance
Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using ndfs.replication property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.java (and set to 1 by default). This is pertinent only to the parts of the system that use NDFS. Distributed search part in Nutch does not have to use NDFS, in fact using multiple local storage gives performance benefits... although it creates a maintenance problem... The difference between how GFS and NDFS work is that Google FS chunks play the role of Nutch segments (these chunks are fully usable fragments of the index) - however, NDFS does NOT work on such high level: it does not replicate segments, only opaque data blocks, which are not directly usable by high-level apps (like segment reader or index reader) and need to be wrapped in a facade. NDFS is ideal for sequential access, but much less so for random access - and Lucene indexes require efficient random access. You can think of NDFS as a very simple distributed filesystem like any other out there (e.g. Coda) - currently it doesn't have this high-level semantics of GFS (yet). Say, I own 1 master machine and 6 data nodes and I set ndfs.replication to 3. After crawling, what the distribution of chunks(suppose only 2 chunks exist). Please correct me if I am wrong. DataNode Chunk No. A 1# B 1# C 1# D 2# E 2# F 2# If I own 5 datanodes, it should be DataNode Chunk No. A 1# B 1# C 1# D 2# E 2# I mean B and C is the totally mirror of A. Is it true in NFS? This is probably true for NDFS - master is an NDFS name node, and data nodes are NDFS data nodes - with one important correction: the replication is NOT on the high-level of chunks (segments), but on the low level of data blocks. This means that high-level structures like segments are NOT replicated - what you end up with is that they are backed by a replicated storage. Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and index them, then it tells Master I wanna to update DataNode A's index, finally it acquires read lock and write lock, and the index is updated. And some operation is applied to DataNode B and C. That's not how it works in Nutch. In Nutch you simply deploy new segments to search servers, and delete the old ones, and the individual search servers periodically check the list of available segments to update their internal lists. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Index update and Google Dance
and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this case? Actually you have to do that manually, but there will be a automatically solution later. Or could you tell me where should I start learning? The nutch wiki and may this url can help: http://wiki.media-style.com/display/nutchDocu/Home Stefan
Re: Index update and Google Dance
Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Jack Tang wrote: Hi Andrzej In document, Michael said: I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using ndfs.replication property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.java (and set to 1 by default). This is pertinent only to the parts of the system that use NDFS. Distributed search part in Nutch does not have to use NDFS, in fact using multiple local storage gives performance benefits... although it creates a maintenance problem... The difference between how GFS and NDFS work is that Google FS chunks play the role of Nutch segments (these chunks are fully usable fragments of the index) - however, NDFS does NOT work on such high level: it does not replicate segments, only opaque data blocks, which are not directly usable by high-level apps (like segment reader or index reader) and need to be wrapped in a facade. NDFS is ideal for sequential access, but much less so for random access - and Lucene indexes require efficient random access. You can think of NDFS as a very simple distributed filesystem like any other out there (e.g. Coda) - currently it doesn't have this high-level semantics of GFS (yet). Say, I own 1 master machine and 6 data nodes and I set ndfs.replication to 3. After crawling, what the distribution of chunks(suppose only 2 chunks exist). Please correct me if I am wrong. DataNode Chunk No. A 1# B 1# C 1# D 2# E 2# F 2# If I own 5 datanodes, it should be DataNode Chunk No. A 1# B 1# C 1# D 2# E 2# I mean B and C is the totally mirror of A. Is it true in NFS? This is probably true for NDFS - master is an NDFS name node, and data nodes are NDFS data nodes - with one important correction: the replication is NOT on the high-level of chunks (segments), but on the low level of data blocks. This means that high-level structures like segments are NOT replicated - what you end up with is that they are backed by a replicated storage. Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and index them, then it tells Master I wanna to update DataNode A's index, finally it acquires read lock and write lock, and the index is updated. And some operation is applied to DataNode B and C. That's not how it works in Nutch. In Nutch you simply deploy new segments to search servers, and delete the old ones, and the individual search servers periodically check the list of available segments to update their internal lists. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Index update and Google Dance
Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and index them, then it tells Master I wanna to update DataNode A's index, finally it acquires read lock and write lock, and the index is updated. And some operation is applied to DataNode B and C. Do you have evidence that this is how Google updates their index? I've never seen much published about that. In the future I would like to implement a more automated distributed search system than Nutch currently has. One way to do this might be to use MapReduce. Each map task's input could be an index and some segment data. The map method would serve queries, i.e., run a Nutch DistributedSearch.Server. It would first copy the index out of NDFS to the local disk, for better performance. It would never exit normally, but rather map forever. When a new version of the index (new set of segments, new boosts, and/or new deletions, etc.) is ready to deploy, then a new job could be submitted. If the number of map tasks (i.e., indexes) is kept equal or less than the number of nodes, and each node is permitted to run two or more tasks, then two versions of the index can be served at once. Once the new version has been deployed (listening for searches on different ports), and search front-ends are using it, then the old version can be stopped by killing its MapReduce job. If a node dies, the MapReduce job tracker would automatically re-start its task on another node. If there were an affinity method between tasks and task trackers, then attempts could be made to re-deploy new versions of indexes whose, e.g., only boosts or deletions have changed, to the same nodes as before. Then the copy of the index to the local disk could be incremental, only copying the parts of the index/segment that have changed. Doug
Re: Index update and Google Dance
Hi Doug On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote: Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and index them, then it tells Master I wanna to update DataNode A's index, finally it acquires read lock and write lock, and the index is updated. And some operation is applied to DataNode B and C. Do you have evidence that this is how Google updates their index? I've never seen much published about that. No, google engineers came to my university for recruitment, and I asked them how google update index. One of them told me, according Google FS architecture, the chunks will be locked first and of couse Master known it so it would not any search query refer to the updating chunks. I read GFS document, and try to give out my explain on google index updating. Regards /Jack In the future I would like to implement a more automated distributed search system than Nutch currently has. One way to do this might be to use MapReduce. Each map task's input could be an index and some segment data. The map method would serve queries, i.e., run a Nutch DistributedSearch.Server. It would first copy the index out of NDFS to the local disk, for better performance. It would never exit normally, but rather map forever. When a new version of the index (new set of segments, new boosts, and/or new deletions, etc.) is ready to deploy, then a new job could be submitted. If the number of map tasks (i.e., indexes) is kept equal or less than the number of nodes, and each node is permitted to run two or more tasks, then two versions of the index can be served at once. Once the new version has been deployed (listening for searches on different ports), and search front-ends are using it, then the old version can be stopped by killing its MapReduce job. If a node dies, the MapReduce job tracker would automatically re-start its task on another node. If there were an affinity method between tasks and task trackers, then attempts could be made to re-deploy new versions of indexes whose, e.g., only boosts or deletions have changed, to the same nodes as before. Then the copy of the index to the local disk could be incremental, only copying the parts of the index/segment that have changed. Doug -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Index update and Google Dance
nutch use the concepts of segments and yes you are able to update part of the index by just delete older older segments and generate / fetch new segments. Stefan Am 08.11.2005 um 18:38 schrieb Jack Tang: Hi I read GFS document and NFS document on the wiki. One interesting question here: does NFS support updating index on the fly? As you known, google updats its index via google dance. It is said that replicator in GFS placed three copies of chunks in different datanode. During index updating, the two chunks will be locked, updated first, finally the remain is updated. I do not know it is true or not. However, from NFS document(http://wiki.apache.org/nutch/NutchDistributedFileSystem) I learned: 1. Files can only be written once. After the first write, they become read-only. (Although they can be deleted.) 2. Files are stream-oriented; you can only append bytes, and you can only read/seek forward. 3. There are no user permissions or quotas, although these could be added fairly easily. There are no write and lock i/o semantics, so we cannot update index in dynamic, right? Regards /Jack P.S. what is the google dance: http://www.metamend.com/google-dance.html Google Dance - The Index Update of the Google Search Engine : http://dance.efactory.de/ -- Keep Discovering ... ... http://www.jroller.com/page/jmars --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: Index update and Google Dance
Jack Tang wrote: Hi Stefan Deleting is totally OK if there is NO references to the chunks(segments). Also, Will master balance the searching request? Say, there are 3 slaves: Slave 1, 2, 3 and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this NDFS has almost nothing to do with this case, because the distributed search uses a separate distributed protocol (DistributedSearch$Server and DistributedSearch$Client), which can use both local segment data and segment data on NDFS. Unfortunately, there is no mechanism (yet) in this IPC to do load balancing. What's more important, the current architecture assumes that there is exactly one copy of each unique document in the segments deployed for searching (so there is no redundancy on this level), and the process to manage this is largely manual... :-( This is clearly an area to be further developed. In my opinion the master (which is the DistributedSearch$Client) should have the knowledge of which segments are deployed on which servers, and also it should know what is a complete set of segments comprising the total index. If you combine this with the information about the current load per server, the master could dispatch partial queries to be run against a subset of segments on each server, depending on its local load, in such a way that partial runs should globally involve the complete set of segments. In this scenario it would be possible to deploy replicas of segments across the set of DS$Server-s without getting duplicate results. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com