On Tue, 30 Nov 2004, sam wrote:
I m looking for a way to crawl only for a given laguage.
Subject is a pretty big domain located on different servers.
there are mostly two languages available and I want to index only one of them.
As I dont have any influence about how they get saved and even dont know most cases yet I hoped there would be a way to have the crawler find out about the language and store only english or only german content in the db.
In addition to the other suggestions that have been provided, it might also be worth taking a look at the external parser/converter support in ht://Dig.
http://www.htdig.org/attrs.html#external_parsers
If you could come up with a decent algorithm for determining whether a document is in the language you are interested in, you might be able to filter at this level.
Jim
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general