Re: scalability limits getDetails, mapFile Readers?
I would like to see something as active, in process and inbound. Active data is live and on the query servers (both indexes and correlating segments) in process are tasks currently being mapped out and inbound is processes/data that is pending to be processed. Active nodes report as in the search pool. In process nodes are really data nodes doing all of the number crunching/import/merging/indexing and inbound is everything in fetch/pre processing. The cycle would be a pull cycle. Active nodes pull from the corresponding data nodes in turn pull from the corresponding inbound nodes. Events/batches could trigger the pull so that it is a complete or useable data set. Some light weight workflow engine could allow you to process/manage the cycle of data. i would like to see dfs block aware - able to process the data on the active data server where that data resides (as much as possible).. such a file table could be used to associate the data through the entire process stream and allow for a fairly linear growth. Such a system could also be aware of its own capacity in that the inbound processes will fail/halt if disk space on the dfs system isn't capable of handling new tasks and vice versa if the active nodes are at capacity tasks could be told to stop/hold. You could use this logic to add more nodes where necessary and resume processing and chart your growth. I come from the ERP/Oracle world so i very much have learned to appreciate the distributed architecture and concept of an aware system such as concurrent processing that is distributed across many nodes and aware of the status of the task and able to hold/wait or act upon the condition of the system and grow fairly linearly as needed. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, * merge 80 segments into 1. A lot of IO involved... and you have to repeat it from time to time. Ugly. I agree. * implement a search server as a map task. Several challenges: it needs to partition the Lucene index, and it has to copy all parts of segments and indexes from DFS to the local storage, otherwise performance will suffer. However, the number of open files per machine would be reduced, because (ideally) each machine would deal with few or a single part of segment and a single part of index... Well I played around and already had a kind of prototype. I had seen following problems: + having a kind of repository of active search servers possibility A: find all tasktrackers running a specific task (already discussed in the hadoop mailing list) possibility B: having a rpc server running in the jvm that runs the search server client, add the hostname to the jobconf and similar to task - jobtracker search server announce itself via hardbeat to the search server 'repository'. + having the index locally and the segment in the dfs. ++ adding to NutchBean init a dfs for index and one for segments could fix this, or more general add support for streamhandlers like dfs:// vs file://. (very long term) + downloading an index from dfs until the mapper starts or just index the segment data to local hdd and let the mapper run for the next 30 days? Stefan
scalability limits getDetails, mapFile Readers?
Hi, We run into a problem with nutch using MapFileOutputFormat#getReaders and getEntry. In detail this happens until summary generation where we open for each segment as much readers as much parts (part- to part-n) we have. Having 80 tasktracker and 80 segments means: 80 x 80 x 4 (parseData, parseText, content, crawl). A search server also needs to open as much files as required for the index searcher. So the problem is a FileNotFoundException, (Too many open files). Opening and closing Readers for each Detail makes no sense. We may can limit the number of readers somehow and close the readers that wasn't used since the longest time. But I'm not that happy with this solution, so any thoughts how we can solve this problem in general? Thanks. Stefan P.S. We also note that the .crc also double the number of open files.
Re: scalability limits getDetails, mapFile Readers?
Stefan Groschupf wrote: Hi, We run into a problem with nutch using MapFileOutputFormat#getReaders and getEntry. In detail this happens until summary generation where we open for each segment as much readers as much parts (part- to part-n) we have. Having 80 tasktracker and 80 segments means: 80 x 80 x 4 (parseData, parseText, content, crawl). A search server also needs to open as much files as required for the index searcher. So the problem is a FileNotFoundException, (Too many open files). Opening and closing Readers for each Detail makes no sense. We may can limit the number of readers somehow and close the readers that wasn't used since the longest time. But I'm not that happy with this solution, so any thoughts how we can solve this problem in general? I don't think we can reduce the number of open files in this case... The solutions that come to my mind are: * merge 80 segments into 1. A lot of IO involved... and you have to repeat it from time to time. Ugly. * implement a search server as a map task. Several challenges: it needs to partition the Lucene index, and it has to copy all parts of segments and indexes from DFS to the local storage, otherwise performance will suffer. However, the number of open files per machine would be reduced, because (ideally) each machine would deal with few or a single part of segment and a single part of index... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: scalability limits getDetails, mapFile Readers?
* merge 80 segments into 1. A lot of IO involved... and you have to repeat it from time to time. Ugly. I agree. * implement a search server as a map task. Several challenges: it needs to partition the Lucene index, and it has to copy all parts of segments and indexes from DFS to the local storage, otherwise performance will suffer. However, the number of open files per machine would be reduced, because (ideally) each machine would deal with few or a single part of segment and a single part of index... Well I played around and already had a kind of prototype. I had seen following problems: + having a kind of repository of active search servers possibility A: find all tasktrackers running a specific task (already discussed in the hadoop mailing list) possibility B: having a rpc server running in the jvm that runs the search server client, add the hostname to the jobconf and similar to task - jobtracker search server announce itself via hardbeat to the search server 'repository'. + having the index locally and the segment in the dfs. ++ adding to NutchBean init a dfs for index and one for segments could fix this, or more general add support for streamhandlers like dfs:// vs file://. (very long term) This seems most interesting to me, as working around issues of a distributed index (while still getting reasonable performance) seem tricky. But being able to build a local index from NDFS, and access the segment data via NDFS when needed for summaries cached files would seem fairly straightforward. Though every time I think I understand Nutch I'm wrong - or the code changes :) -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers