mapSearcher was Re: Index update and Google Dance

2005-11-11 Thread Stefan Groschupf

Hi Doug,
In the future I would like to implement a more automated  
distributed search system than Nutch currently has.  One way to do  
this might be to use MapReduce.  Each map task's input could be an  
index and some segment data.  The map method would serve queries,  
i.e., run a Nutch DistributedSearch.Server.  It would first copy  
the index out of NDFS to the local disk, for better performance.


I have 2 questions regarding this mechanism.
First, what you plan to make the running search servers known by the  
master (search client) I can imaging a similar mechanism as the  
tasktracker and jobtracker use, a kind of heart beat message.
Second wouldn't be there also a possibility to solve nutch-92  
(DistributedSearch incorrectly scores results) by first running a map  
reduce task over the indexes that counting terms and than hold this  
somehow in the memory of master (search server client). But I'm not  
sure if that is may to much data.


Stefan




Re: Index update and Google Dance

2005-11-09 Thread Andrzej Bialecki

Jack Tang wrote:


Hi Andrzej

In document, Michael said:
I'd strongly recommend using the system with a replication rate of 3
copies, 2 minimum. Desired replication can be set in nutch config file
using ndfs.replication property, and MIN_REPLICATION constant is
located in ndfs/FSNamesystem.java (and set to 1 by default).
 



This is pertinent only to the parts of the system that use NDFS. 
Distributed search part in Nutch does not have to use NDFS, in fact 
using multiple local storage gives performance benefits... although it 
creates a maintenance problem...


The difference between how GFS and NDFS work is that Google FS chunks 
play the role of Nutch segments (these chunks are fully usable fragments 
of the index) - however, NDFS does NOT work on such high level: it does 
not replicate segments, only opaque data blocks, which are not directly 
usable by high-level apps (like segment reader or index reader) and need 
to be wrapped in a facade. NDFS is ideal for sequential access, but much 
less so for random access - and Lucene indexes require efficient random 
access.


You can think of NDFS as a very simple distributed filesystem like any 
other out there (e.g. Coda) - currently it doesn't have this high-level 
semantics of GFS (yet).



Say, I own 1 master machine and 6 data nodes and I set
ndfs.replication to 3. After crawling, what the distribution of
chunks(suppose only 2 chunks exist). Please correct me if I am wrong.

DataNode  Chunk No.
A  1#
B  1#
C  1#

D  2#
E  2#
F  2#

If I own 5 datanodes, it should be

DataNode   Chunk No.
A  1#
B  1#
C  1#

D  2#
E  2#

I mean B and C is the totally mirror of A. Is it true in NFS?
 



This is probably true for NDFS - master is an NDFS name node, and data 
nodes are NDFS data nodes - with one important correction: the 
replication is NOT on the high-level of chunks (segments), but on the 
low level of data blocks. This means that high-level structures like 
segments are NOT replicated - what you end up with is that they are 
backed by a replicated storage.



Below is google architecture in my brain:

   DataNode A
Master  DataNode B   GoogleCrawler
   DataNode C
   ..
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and index them, then it tells Master
I wanna to update DataNode A's index, finally it acquires read
lock and write lock, and the index is updated. And some operation
is applied to DataNode B and C.
 



That's not how it works in Nutch. In Nutch you simply deploy new 
segments to search servers, and delete the old ones, and the individual 
search servers periodically check the list of available segments to 
update their internal lists.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Index update and Google Dance

2005-11-09 Thread Stefan Groschupf



and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
case?
Actually you have to do that manually, but there will be a  
automatically solution later.



Or could you tell me where should I start learning?

The nutch wiki and may this url can help:
http://wiki.media-style.com/display/nutchDocu/Home

Stefan



Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.

Regards
/Jack

On 11/9/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Jack Tang wrote:

 Hi Andrzej
 
 In document, Michael said:
 I'd strongly recommend using the system with a replication rate of 3
 copies, 2 minimum. Desired replication can be set in nutch config file
 using ndfs.replication property, and MIN_REPLICATION constant is
 located in ndfs/FSNamesystem.java (and set to 1 by default).
 
 

 This is pertinent only to the parts of the system that use NDFS.
 Distributed search part in Nutch does not have to use NDFS, in fact
 using multiple local storage gives performance benefits... although it
 creates a maintenance problem...

 The difference between how GFS and NDFS work is that Google FS chunks
 play the role of Nutch segments (these chunks are fully usable fragments
 of the index) - however, NDFS does NOT work on such high level: it does
 not replicate segments, only opaque data blocks, which are not directly
 usable by high-level apps (like segment reader or index reader) and need
 to be wrapped in a facade. NDFS is ideal for sequential access, but much
 less so for random access - and Lucene indexes require efficient random
 access.

 You can think of NDFS as a very simple distributed filesystem like any
 other out there (e.g. Coda) - currently it doesn't have this high-level
 semantics of GFS (yet).

 Say, I own 1 master machine and 6 data nodes and I set
 ndfs.replication to 3. After crawling, what the distribution of
 chunks(suppose only 2 chunks exist). Please correct me if I am wrong.
 
 DataNode  Chunk No.
 A  1#
 B  1#
 C  1#
 
 D  2#
 E  2#
 F  2#
 
 If I own 5 datanodes, it should be
 
 DataNode   Chunk No.
 A  1#
 B  1#
 C  1#
 
 D  2#
 E  2#
 
 I mean B and C is the totally mirror of A. Is it true in NFS?
 
 

 This is probably true for NDFS - master is an NDFS name node, and data
 nodes are NDFS data nodes - with one important correction: the
 replication is NOT on the high-level of chunks (segments), but on the
 low level of data blocks. This means that high-level structures like
 segments are NOT replicated - what you end up with is that they are
 backed by a replicated storage.

 Below is google architecture in my brain:
 
 DataNode A
 Master  DataNode B   GoogleCrawler
 DataNode C
 ..
 GoogleCrawler is kept running all the time. One day, it gets fethlist
 from DataNode A, crawls all pages and index them, then it tells Master
 I wanna to update DataNode A's index, finally it acquires read
 lock and write lock, and the index is updated. And some operation
 is applied to DataNode B and C.
 
 

 That's not how it works in Nutch. In Nutch you simply deploy new
 segments to search servers, and delete the old ones, and the individual
 search servers periodically check the list of available segments to
 update their internal lists.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Index update and Google Dance

2005-11-09 Thread Doug Cutting

Jack Tang wrote:

Below is google architecture in my brain:

DataNode A
Master  DataNode B   GoogleCrawler
DataNode C
..
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and index them, then it tells Master
I wanna to update DataNode A's index, finally it acquires read
lock and write lock, and the index is updated. And some operation
is applied to DataNode B and C.


Do you have evidence that this is how Google updates their index?  I've 
never seen much published about that.


In the future I would like to implement a more automated distributed 
search system than Nutch currently has.  One way to do this might be to 
use MapReduce.  Each map task's input could be an index and some segment 
data.  The map method would serve queries, i.e., run a Nutch 
DistributedSearch.Server.  It would first copy the index out of NDFS to 
the local disk, for better performance.  It would never exit normally, 
but rather map forever.  When a new version of the index (new set of 
segments, new boosts, and/or new deletions, etc.) is ready to deploy, 
then a new job could be submitted.  If the number of map tasks (i.e., 
indexes) is kept equal or less than the number of nodes, and each node 
is permitted to run two or more tasks, then two versions of the index 
can be served at once.  Once the new version has been deployed 
(listening for searches on different ports), and search front-ends are 
using it, then the old version can be stopped by killing its MapReduce 
job.  If a node dies, the MapReduce job tracker would automatically 
re-start its task on another node.


If there were an affinity method between tasks and task trackers, then 
attempts could be made to re-deploy new versions of indexes whose, e.g., 
only boosts or deletions have changed, to the same nodes as before. 
Then the copy of the index to the local disk could be incremental, only 
copying the parts of the index/segment that have changed.


Doug


Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Doug

On 11/10/05, Doug Cutting [EMAIL PROTECTED] wrote:
 Jack Tang wrote:
  Below is google architecture in my brain:
 
  DataNode A
  Master  DataNode B   GoogleCrawler
  DataNode C
  ..
  GoogleCrawler is kept running all the time. One day, it gets fethlist
  from DataNode A, crawls all pages and index them, then it tells Master
  I wanna to update DataNode A's index, finally it acquires read
  lock and write lock, and the index is updated. And some operation
  is applied to DataNode B and C.

 Do you have evidence that this is how Google updates their index?  I've
 never seen much published about that.
No, google engineers came to my university for recruitment, and I
asked them how google update index. One of them told me, according
Google FS architecture, the chunks will be locked first and of couse
Master known it so it would not any search query refer to the updating
chunks.
I read GFS document, and try to give out my explain on google index updating.

Regards
/Jack

 In the future I would like to implement a more automated distributed
 search system than Nutch currently has.  One way to do this might be to
 use MapReduce.  Each map task's input could be an index and some segment
 data.  The map method would serve queries, i.e., run a Nutch
 DistributedSearch.Server.  It would first copy the index out of NDFS to
 the local disk, for better performance.  It would never exit normally,
 but rather map forever.  When a new version of the index (new set of
 segments, new boosts, and/or new deletions, etc.) is ready to deploy,
 then a new job could be submitted.  If the number of map tasks (i.e.,
 indexes) is kept equal or less than the number of nodes, and each node
 is permitted to run two or more tasks, then two versions of the index
 can be served at once.  Once the new version has been deployed
 (listening for searches on different ports), and search front-ends are
 using it, then the old version can be stopped by killing its MapReduce
 job.  If a node dies, the MapReduce job tracker would automatically
 re-start its task on another node.

 If there were an affinity method between tasks and task trackers, then
 attempts could be made to re-deploy new versions of indexes whose, e.g.,
 only boosts or deletions have changed, to the same nodes as before.
 Then the copy of the index to the local disk could be incremental, only
 copying the parts of the index/segment that have changed.

 Doug



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Index update and Google Dance

2005-11-08 Thread Stefan Groschupf
nutch use the concepts of segments and yes you are able to update  
part of the index by just delete older older segments and generate /  
fetch new segments.

Stefan

Am 08.11.2005 um 18:38 schrieb Jack Tang:


Hi

I read GFS document and NFS document on the wiki. One interesting
question here: does NFS support updating index on the fly?

As you known, google updats its index via google dance. It is said
that replicator in GFS placed three copies of chunks in different
datanode. During index updating, the two chunks will be locked,
updated first, finally the remain is updated. I do not know it is true
or not. However, from NFS
document(http://wiki.apache.org/nutch/NutchDistributedFileSystem) I
learned:

   1. Files can only be written once. After the first write, they
become read-only. (Although they can be deleted.)
   2. Files are stream-oriented; you can only append bytes, and you
can only read/seek forward.
   3. There are no user permissions or quotas, although these could be
added fairly easily.

There are no write and lock i/o semantics, so we cannot update
index in dynamic, right?

Regards
/Jack

P.S.
what is the google dance: http://www.metamend.com/google-dance.html
Google Dance - The Index Update of the Google Search Engine :
http://dance.efactory.de/
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: Index update and Google Dance

2005-11-08 Thread Andrzej Bialecki

Jack Tang wrote:


Hi Stefan

Deleting is totally OK if there is NO references to the chunks(segments).
Also, Will master balance the searching request? Say, there are 3
slaves: Slave 1, 2, 3
and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
 



NDFS has almost nothing to do with this case, because the distributed 
search uses a separate distributed protocol (DistributedSearch$Server 
and DistributedSearch$Client), which can use both local segment data and 
segment data on NDFS.


Unfortunately, there is no mechanism (yet) in this IPC to do load 
balancing. What's more important, the current architecture assumes that 
there is exactly one copy of each unique document in the segments 
deployed for searching (so there is no redundancy on this level), and 
the process to manage this is largely manual... :-( This is clearly an 
area to be further developed.


In my opinion the master (which is the DistributedSearch$Client) 
should have the knowledge of which segments are deployed on which 
servers, and also it should know what is a complete set of segments 
comprising the total index. If you combine this with the information 
about the current load per server, the master could dispatch partial 
queries to be run against a subset of segments on each server, depending 
on its local load, in such a way that partial runs should globally 
involve the complete set of segments. In this scenario it would be 
possible to deploy replicas of segments across the set of DS$Server-s 
without getting duplicate results.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com