Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-10 Thread Andrzej Bialecki

Lukas Vlcek wrote:

New Wiki pages are created:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs

I hope I didn't introduce a lot of typos and issues into text.
Also I tried to stick to original style.



Thanks a lot, looks great.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-10 Thread Lukas Vlcek

New Wiki pages are created:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs

I hope I didn't introduce a lot of typos and issues into text.
Also I tried to stick to original style.

Regards,
Lukas

On 5/10/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Lukas Vlcek wrote:
> Andrzej,
>
> My pleasure. I would choose the following location:
> http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
> Let me know if you can think of anything better otherwise I'll do it.

Thanks, that would be a perfect place.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-09 Thread Andrzej Bialecki

Lukas Vlcek wrote:

Andrzej,

My pleasure. I would choose the following location:
http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
Let me know if you can think of anything better otherwise I'll do it.


Thanks, that would be a perfect place.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-09 Thread Lukas Vlcek

Andrzej,

My pleasure. I would choose the following location:
http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
Let me know if you can think of anything better otherwise I'll do it.

Regards,
Lukas

On 5/9/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Lukas Vlcek wrote:
> Andrzej,
> Thanks for your effort!
>
> Are you goigng to post tool descriptions somewhere on the wiki or
> tutorial? It would be great if this information could be available to
> people out of the dev-mail list as well.

If you have some spare cycles, would you be willing to do this? Take
excerpts from my email and from the Javadoc - I tried to make especially
the Javadoc as complete as possible...

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-09 Thread Andrzej Bialecki

Lukas Vlcek wrote:

Andrzej,
Thanks for your effort!

Are you goigng to post tool descriptions somewhere on the wiki or
tutorial? It would be great if this information could be available to
people out of the dev-mail list as well.


If you have some spare cycles, would you be willing to do this? Take 
excerpts from my email and from the Javadoc - I tried to make especially 
the Javadoc as complete as possible...


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-08 Thread Lukas Vlcek

Andrzej,
Thanks for your effort!

Are you goigng to post tool descriptions somewhere on the wiki or
tutorial? It would be great if this information could be available to
people out of the dev-mail list as well.

Regards,
Lukas

On 5/9/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Hi all,

I just committed a couple of new tools, and I'd like to briefly explain
their purpose and intended use.

* CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge
several existing DBs into one. This comes useful if you ran several
partial crawls and you'd like to combine the DBs. Optionally, you can
run current URLFilters on URLs in the databases, to filter out unwanted
URLs. This works also if you run it with just one input DB, which means
that you can use this tool for weeding out unwanted URLs from a single DB.

* LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a
similar purpose as above, and with similar options. Please note that
URLFilters, if activated, will apply to both target and source URLs.
This tool can be useful if you built partial linkdb-s from groups of
segments, and then you need to integrate them into one (e.g. for
indexing or for searching). Or you can use it with a single linkdb, just
to filter out unwanted URLs.

* SegmentMerger: available as 'mergesegs'. This tool merges several
input segments into one or more output segments, with optional filtering
as above. Optionally, the output data can be divided into several
smaller segments of fixed size. There are many do-s and dont-s regarding
the use of this tool, described in Javadoc - please be sure to read them
before using. The purpose of this tool is to e.g. re-shape your segments
(in preparation for deployment to search servers), or to filter out
unwanted data, or to minimize the number of active segments.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

2006-05-08 Thread Andrzej Bialecki

Hi all,

I just committed a couple of new tools, and I'd like to briefly explain 
their purpose and intended use.


* CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge 
several existing DBs into one. This comes useful if you ran several 
partial crawls and you'd like to combine the DBs. Optionally, you can 
run current URLFilters on URLs in the databases, to filter out unwanted 
URLs. This works also if you run it with just one input DB, which means 
that you can use this tool for weeding out unwanted URLs from a single DB.


* LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a 
similar purpose as above, and with similar options. Please note that 
URLFilters, if activated, will apply to both target and source URLs. 
This tool can be useful if you built partial linkdb-s from groups of 
segments, and then you need to integrate them into one (e.g. for 
indexing or for searching). Or you can use it with a single linkdb, just 
to filter out unwanted URLs.


* SegmentMerger: available as 'mergesegs'. This tool merges several 
input segments into one or more output segments, with optional filtering 
as above. Optionally, the output data can be divided into several 
smaller segments of fixed size. There are many do-s and dont-s regarding 
the use of this tool, described in Javadoc - please be sure to read them 
before using. The purpose of this tool is to e.g. re-shape your segments 
(in preparation for deployment to search servers), or to filter out 
unwanted data, or to minimize the number of active segments.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com