Re: scalability limits getDetails, mapFile Readers?

2006-03-02 Thread Byron Miller
I would like to see something as active, in process
and  inbound.  Active data is live and on the query
servers (both indexes and correlating segments) in
process are tasks currently being mapped out and
inbound is processes/data that is pending to be
processed.

Active nodes report as in the search pool.  In process
nodes are really data nodes doing all of the number
crunching/import/merging/indexing and inbound is
everything in fetch/pre processing.

The cycle would be a pull cycle. Active nodes pull
from the corresponding data nodes in turn pull from
the corresponding inbound nodes.  Events/batches could
trigger the pull so that it is a complete or useable
data set. Some light weight workflow engine could
allow you to process/manage the cycle of data.

i would like to see dfs block aware - able to process
the data on the active data server where that data
resides (as much as possible).. such a file table
could be used to associate the data through the entire
process stream and allow for a fairly linear growth.  

Such a system could also be aware of its own capacity
in that the inbound processes will fail/halt if disk
space on the dfs system isn't capable of handling new
tasks and vice versa if the active nodes are at
capacity tasks could be told to stop/hold.  You could
use this logic to add more nodes where necessary and
resume processing and chart your growth.

I come from the ERP/Oracle world so i very much have
learned to appreciate the distributed architecture and
concept of an aware system such as concurrent
processing that is distributed across many nodes and
aware of the status of the task and able to hold/wait
or act upon the condition of the system and grow
fairly linearly as needed.

-byron

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi Andrzej,
 
 
  * merge 80 segments into 1. A lot of IO
 involved... and you have to  
  repeat it from time to time. Ugly.
 I agree.
 
  * implement a search server as a map task. Several
 challenges: it  
  needs to partition the Lucene index, and it has to
 copy all parts  
  of segments and indexes from DFS to the local
 storage, otherwise  
  performance will suffer. However, the number of
 open files per  
  machine would be reduced, because (ideally) each
 machine would deal  
  with few or a single part of segment and a single
 part of index...
 
 Well I played around and already had a kind of
 prototype.
 I had seen following problems:
 
 + having a kind of repository of active search
 servers
 possibility A: find all tasktrackers running a
 specific task (already  
 discussed in the hadoop mailing list)
 possibility B: having a rpc server running in the
 jvm that runs the  
 search server client, add the hostname to the
 jobconf and similar to  
 task - jobtracker search server announce itself via
 hardbeat to the  
 search server 'repository'.
 
 + having the index locally and the segment in the
 dfs.
 ++ adding to NutchBean init a dfs for index and one
 for segments  
 could fix this, or more general add support for
 streamhandlers like  
 dfs:// vs file://. (very long term)
 
 + downloading an index from dfs until the mapper
 starts or just index  
 the segment data to local hdd and let the mapper run
 for the next 30  
 days?
 
 Stefan 
 



scalability limits getDetails, mapFile Readers?

2006-03-01 Thread Stefan Groschupf

Hi,

We run into a problem with nutch using  
MapFileOutputFormat#getReaders  and getEntry.
In detail this happens until summary generation where we open for  
each segment as much readers as much parts (part- to part-n) we  
have.

Having 80 tasktracker and 80 segments means:
80 x 80 x 4 (parseData, parseText, content, crawl). A search server   
also needs to open as much files as required for the index searcher.

So the problem is a FileNotFoundException, (Too many open files).

Opening and closing Readers for each Detail makes no sense. We may  
can limit the number of readers somehow and close the readers that  
wasn't used since the longest time.
But I'm not that happy with this solution, so any thoughts how we can  
solve this problem in general?


Thanks.
Stefan

P.S. We also note that the .crc also double the number of open files.


Re: scalability limits getDetails, mapFile Readers?

2006-03-01 Thread Andrzej Bialecki

Stefan Groschupf wrote:

Hi,

We run into a problem with nutch using MapFileOutputFormat#getReaders  
and getEntry.
In detail this happens until summary generation where we open for each 
segment as much readers as much parts (part- to part-n) we have.

Having 80 tasktracker and 80 segments means:
80 x 80 x 4 (parseData, parseText, content, crawl). A search server  
also needs to open as much files as required for the index searcher.

So the problem is a FileNotFoundException, (Too many open files).

Opening and closing Readers for each Detail makes no sense. We may can 
limit the number of readers somehow and close the readers that wasn't 
used since the longest time.
But I'm not that happy with this solution, so any thoughts how we can 
solve this problem in general?


I don't think we can reduce the number of open files in this case... The 
solutions that come to my mind are:


* merge 80 segments into 1. A lot of IO involved... and you have to 
repeat it from time to time. Ugly.


* implement a search server as a map task. Several challenges: it needs 
to partition the Lucene index, and it has to copy all parts of segments 
and indexes from DFS to the local storage, otherwise performance will 
suffer. However, the number of open files per machine would be reduced, 
because (ideally) each machine would deal with few or a single part of 
segment and a single part of index...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: scalability limits getDetails, mapFile Readers?

2006-03-01 Thread Ken Krugler


* merge 80 segments into 1. A lot of IO involved... and you have to 
repeat it from time to time. Ugly.

I agree.


* implement a search server as a map task. Several challenges: it 
needs to partition the Lucene index, and it has to copy all parts 
of segments and indexes from DFS to the local storage, otherwise 
performance will suffer. However, the number of open files per 
machine would be reduced, because (ideally) each machine would deal 
with few or a single part of segment and a single part of index...


Well I played around and already had a kind of prototype.
I had seen following problems:

+ having a kind of repository of active search servers
possibility A: find all tasktrackers running a specific task 
(already discussed in the hadoop mailing list)
possibility B: having a rpc server running in the jvm that runs the 
search server client, add the hostname to the jobconf and similar to 
task - jobtracker search server announce itself via hardbeat to the 
search server 'repository'.


+ having the index locally and the segment in the dfs.
++ adding to NutchBean init a dfs for index and one for segments 
could fix this, or more general add support for streamhandlers like 
dfs:// vs file://. (very long term)


This seems most interesting to me, as working around issues of a 
distributed index (while still getting reasonable performance) seem 
tricky. But being able to build a local index from NDFS, and access 
the segment data via NDFS when needed for summaries  cached files 
would seem fairly straightforward.


Though every time I think I understand Nutch I'm wrong - or the code changes :)

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers