Hi Pablo,
Thanks for your response.
El 15/12/12 17:39, Pablo N. Mendes escribió:
Hi Rafa,
The part that is perhaps confusing here is that the stopword list is
used in multiple places. The SpanishAnalyzer removes them from the
context index (used in disambiguation). What you report is that you
see stopwords being spotted, which is a problem with your spotter
dictionary (and the class that created it) or the spotter implementation.
Also punctuations marks are being spotted (mainly quotes and dots)
Try this:
1) check if your *indexing.es.properties* configuration is pointing to
the right stopwords file for spanish. If yes, check if that file
contains the undesired words you see spotted. If no, that's your problem.
Checked. The indexing.properties files is pointing correctly to the file
and also the file contains a proper list of words, one per line. The
format is UTF-8.
2) check if surfaceForms.tsv contain these spurious stopwords. If yes,
then you need to double check what's happening in
IndexLingPipeSpotter. Create a small surfaceForms.tsv and
stopwords.txt and step through the code
Yes, the generated surfaceForms.tsv file contains stopwords, but this
file is generated after ExtractOccsFromWikipedia launcher and a
post-processing stage. Then, IndexLingPipeSpotter is executed. I suppose
this last launcher is the one that must remove the stopwords, so maybe
I'm wrong but I think that having stopwords in surfaceForms.tsv is not
the problem. I'm going to test IndexLingPipeSpotter as you suggest with
small datasets and also I'm going to have a look in the code trying to
guess where the stopwords are being filtered in the code.
Anyway, do you think that stopwords should be filtered from the input
text for security?
Which spotter are you using? I am assuming it is LingPipeSpotter.
Yes, I'm using LingPipeSpotter
Cheers
pablo
Thanks. Pablo
On Dec 15, 2012 12:13 AM, "Rafa Haro" <[email protected]
<mailto:[email protected]>> wrote:
Hi all,
I'm not sure if this is a bug, a problem with my local installation or
an issue in the project. Testing our local installation in Spanish we
are having problems with the list of stopwords. I'm almost sure
that the
list is being used properly during the indexing with Lucene's
SpanishAnalyzer. But then, when we annotate a text in Spanish, some
stopwords are selected as spotters and finally linked with a
candidate.
That is also happening sometimes with punctuation marks (dots,
quotes....).
Actually, I don't know if the system applies a stopwords removal
process
to the input text, but I was supposing that it should do it to prevent
this behaviour. Am I right??
Thanks. Regards
This message should be regarded as confidential. If you have
received this email in error please notify the sender and destroy
it immediately. Statements of intent shall only become binding
when confirmed in hard copy by an authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration
number 6440931. The Registered Office is 222 Westbourne Studios,
242 Acklam Road, London W10 5JJ, UK.
------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add
services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately. Statements
of intent shall only become binding when confirmed in hard copy by an
authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
London W10 5JJ, UK.
------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users