Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Lukas Vlcek wrote: New Wiki pages are created: http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs I hope I didn't introduce a lot of typos and issues into text. Also I tried to stick to original style. Thanks a lot, looks great. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
New Wiki pages are created: http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs I hope I didn't introduce a lot of typos and issues into text. Also I tried to stick to original style. Regards, Lukas On 5/10/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Lukas Vlcek wrote: > Andrzej, > > My pleasure. I would choose the following location: > http://wiki.apache.org/nutch/DevelopmentCommandLineOptions > Let me know if you can think of anything better otherwise I'll do it. Thanks, that would be a perfect place. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Lukas Vlcek wrote: Andrzej, My pleasure. I would choose the following location: http://wiki.apache.org/nutch/DevelopmentCommandLineOptions Let me know if you can think of anything better otherwise I'll do it. Thanks, that would be a perfect place. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Andrzej, My pleasure. I would choose the following location: http://wiki.apache.org/nutch/DevelopmentCommandLineOptions Let me know if you can think of anything better otherwise I'll do it. Regards, Lukas On 5/9/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Lukas Vlcek wrote: > Andrzej, > Thanks for your effort! > > Are you goigng to post tool descriptions somewhere on the wiki or > tutorial? It would be great if this information could be available to > people out of the dev-mail list as well. If you have some spare cycles, would you be willing to do this? Take excerpts from my email and from the Javadoc - I tried to make especially the Javadoc as complete as possible... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Lukas Vlcek wrote: Andrzej, Thanks for your effort! Are you goigng to post tool descriptions somewhere on the wiki or tutorial? It would be great if this information could be available to people out of the dev-mail list as well. If you have some spare cycles, would you be willing to do this? Take excerpts from my email and from the Javadoc - I tried to make especially the Javadoc as complete as possible... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Andrzej, Thanks for your effort! Are you goigng to post tool descriptions somewhere on the wiki or tutorial? It would be great if this information could be available to people out of the dev-mail list as well. Regards, Lukas On 5/9/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Hi all, I just committed a couple of new tools, and I'd like to briefly explain their purpose and intended use. * CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB. * LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a similar purpose as above, and with similar options. Please note that URLFilters, if activated, will apply to both target and source URLs. This tool can be useful if you built partial linkdb-s from groups of segments, and then you need to integrate them into one (e.g. for indexing or for searching). Or you can use it with a single linkdb, just to filter out unwanted URLs. * SegmentMerger: available as 'mergesegs'. This tool merges several input segments into one or more output segments, with optional filtering as above. Optionally, the output data can be divided into several smaller segments of fixed size. There are many do-s and dont-s regarding the use of this tool, described in Javadoc - please be sure to read them before using. The purpose of this tool is to e.g. re-shape your segments (in preparation for deployment to search servers), or to filter out unwanted data, or to minimize the number of active segments. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger
Hi all, I just committed a couple of new tools, and I'd like to briefly explain their purpose and intended use. * CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB. * LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a similar purpose as above, and with similar options. Please note that URLFilters, if activated, will apply to both target and source URLs. This tool can be useful if you built partial linkdb-s from groups of segments, and then you need to integrate them into one (e.g. for indexing or for searching). Or you can use it with a single linkdb, just to filter out unwanted URLs. * SegmentMerger: available as 'mergesegs'. This tool merges several input segments into one or more output segments, with optional filtering as above. Optionally, the output data can be divided into several smaller segments of fixed size. There are many do-s and dont-s regarding the use of this tool, described in Javadoc - please be sure to read them before using. The purpose of this tool is to e.g. re-shape your segments (in preparation for deployment to search servers), or to filter out unwanted data, or to minimize the number of active segments. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com