Hallo everybody

I think this announcement can be quite interesting for some people on
the list, so I'm forwarding this here.

Translate.org.za developed CorpusCatcher to help in building web corpora
specifically for applications in spell checker building. The idea is
that this is something that can easily be extended for specific
applications.

For any comments or to contribute improvements, please join the
translate-devel mailing list here:
https://lists.sourceforge.net/lists/listinfo/translate-devel

Keep well
Friedel

-------- Forwarded Message --------
From: Walter Leibbrandt
To: [EMAIL PROTECTED]
Subject: [Translate-devel] Introducing CorpusCatcher 0.1
Date: Thu, 17 Jul 2008 16:24:49 +0200

The first version of CorpusCatcher was released recently. CorpusCatcher 
is a toolset for creating language corpora by crawling the Web. It was 
based on BootCaT 
(http://sslmit.unibo.it/~baroni/tools_and_resources.html), but evolved 
into a stand-alone project. Thanks to Kevin Scannell for his advice in 
this regard.

Its main features are:
- Querying Yahoo! for pages containing specific seed words.
- Crawling the web for relevant pages.
- Extracting the text from found pages.
- Filtering results based on positive and/or negative word lists.

The release is available for download at 
https://sourceforge.net/project/showfiles.php?group_id=91920&package_id=284333
The live documentation is available on the wiki at 
http://translate.sourceforge.net/wiki/corpuscatcher/index

Dependecies to use CorpusCatcher:
- Python >= 2.4
- mechanize 0.1.7b
- pYsearch 3.0

See 
http://translate.sourceforge.net/wiki/corpuscatcher/readme#installation 
for installation details.

Please report any bugs found at http://bugs.locamotion.org


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to