Hi Chris,
A null pointer exception can be causes by not checking
newToken for null after this line:
Token newToken = input.next()
I think Hoss meant to call next() on the input as long as returned
tokens do not satisfy the check for being a named entity.
Also, this code assumes white space in the token - which you won't
have since using a WhiteSpaceAnalyzer.
For returning single word names I think something like this should work:
Token t;
while ((t = in.next())!=null && !
Character.isUpperCase(t.termText().getCharAt(0)))
{
}
return t;
For identifying two consecutive token starting with an upper case character
and returning them as a single name a bit more code is required.
Btw, I don't understand why the NGram.
HTH, Doron
On Jan 8, 2008 5:05 PM, chris.b <[EMAIL PROTECTED]> wrote:
>
> Following your suggestion (I think), I built a tokenfilter with the
> following
> code for next():
>
> public final Token next() throws IOException {
> Token newToken = input.next();
> termText = newToken.termText();
> Character tempChar = termText.charAt(0);
> if(Character.isUpperCase(tempChar)) {
> for(int current = 0; current < termText.length();
> current++){
> Character currentChar = termText.charAt
> (current);
> if (Character.isWhitespace(currentChar) &
> Character.isUpperCase(currentChar + 1) & current != termText.length()) {
> return newToken;
> }
> }
> }
> return null;
> }
>
> -----------
> and in calling this filter, i'm also calling NGramAnalyzerWrapper wrapping
> WhitespaceAnalyzer (these two work together), but when building my index i
> get the following error:
>
> Exception in thread "main" java.lang.NullPointerException
> at rem.NamedEntityTokenFilter.next(NamedEntityTokenFilter.java:21)
> at
> org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java
> :219)
> at
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:95)
> at
> org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java
> :1013)
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java
> :1001)
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java
> :983)
> at ancorpMethods.Handlers.handleDOC(Handlers.java:92)
> at ancorpMethods.Handlers.handleDir(Handlers.java:32)
> at ancorpMethods.Handlers.handleDir(Handlers.java:30)
> at ancorpMethods.Handlers.handleDir(Handlers.java:30)
> at ancorpMethods.Handlers.handleDir(Handlers.java:30)
> at ancorpMethods.Handlers.handleDir(Handlers.java:30)
> at Base.Indexer.indexCapitalNgrams(Indexer.java:155)
> at Base.Indexer.main(Indexer.java:81)
>
> ----------
> am I forgetting something or am I going the wrong way? :|
>
>