date:20130423

Reading Payloads

2013-04-23 Thread Carsten Schnober

Hi,
I'm trying to extract payloads from an index for specific tokens the
following way (inserting sample document number and term):

Terms terms = reader.getTermVector(16504, "term");
TokenStream tokenstream = TokenSources.getTokenStream(terms);
while (tokenstream.incrementToken()) {
  OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
  int start = offset.startOffset();
  int end = offset.endOffset();
  String token =
tokenstream.getAttribute(CharTermAttribute.class).toString();

  PayloadAttribute payloadAttr =
tokenstream.addAttribute(PayloadAttribute.class);
  BytesRef payloadBytes = payloadAttr.getPayload();

  ...
}

This works fine for the OffsetAttribute and the CharTermAttribute, but
payloadAttr.getPayload() always returns null for all documents and all
tokens, unfortunately. However, I know that the payloads are stored in
the index as I can retrieve them through a SpanQuery with
Spans.getPayload(). I actually expect every token to carry a payload, as
I'm my custom tokenizer implementation has the following lines:

public class KoraTokenizer extends Tokenizer {
  ...
  private PayloadAttribute payloadAttr =
addAttribute(PayloadAttribute.class);
  ...
  public boolean incrementToken() {
...
payloadAttr.setPayload(new BytesRef(payloadString));
...
  }
  ...
}

I've asserted that the payloadString variable is never an empty String
and as I said above, I can retrieve the Payloads with
Spans.getPayload(). So what do I do wrong in my
tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
tokenstream.getAttribute() before as for the other attributes but this
obviously threw an IllegalArgumentException so I implemented the
recommendation given in the documentation and replaced it by addAttribute().

Thanks!
Carsten




-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Reading Payloads

2013-04-23 Thread Uwe Schindler

TermVectors are per-document and do not contain payloads. You are reading the 
per-document TermVectors which is a "small index" *stored* for each document as 
a binary blob. This blob only contains the terms of this document with its 
positions/offsets, but no payloads (offsets are used e.g. for highlighting).

To retrieve payloads, you have to use the main TermsEnum and main posting 
lists, but this does *not* work per document. In general you would execute a 
query and then retrieve the payload for each hit while iterating the scorer 
(e.g. function queries can do this).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
> Sent: Tuesday, April 23, 2013 1:04 PM
> To: java-user
> Subject: Reading Payloads
> 
> Hi,
> I'm trying to extract payloads from an index for specific tokens the following
> way (inserting sample document number and term):
> 
> Terms terms = reader.getTermVector(16504, "term"); TokenStream
> tokenstream = TokenSources.getTokenStream(terms);
> while (tokenstream.incrementToken()) {
>   OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
>   int start = offset.startOffset();
>   int end = offset.endOffset();
>   String token =
> tokenstream.getAttribute(CharTermAttribute.class).toString();
> 
>   PayloadAttribute payloadAttr =
> tokenstream.addAttribute(PayloadAttribute.class);
>   BytesRef payloadBytes = payloadAttr.getPayload();
> 
>   ...
> }
> 
> This works fine for the OffsetAttribute and the CharTermAttribute, but
> payloadAttr.getPayload() always returns null for all documents and all
> tokens, unfortunately. However, I know that the payloads are stored in the
> index as I can retrieve them through a SpanQuery with Spans.getPayload(). I
> actually expect every token to carry a payload, as I'm my custom tokenizer
> implementation has the following lines:
> 
> public class KoraTokenizer extends Tokenizer {
>   ...
>   private PayloadAttribute payloadAttr =
> addAttribute(PayloadAttribute.class);
>   ...
>   public boolean incrementToken() {
> ...
> payloadAttr.setPayload(new BytesRef(payloadString));
> ...
>   }
>   ...
> }
> 
> I've asserted that the payloadString variable is never an empty String and as 
> I
> said above, I can retrieve the Payloads with Spans.getPayload(). So what do I
> do wrong in my
> tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> tokenstream.getAttribute() before as for the other attributes but this
> obviously threw an IllegalArgumentException so I implemented the
> recommendation given in the documentation and replaced it by
> addAttribute().
> 
> Thanks!
> Carsten
> 
> 
> 
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Reading Payloads

2013-04-23 Thread Michael McCandless

Actually, term vectors can store payloads now (LUCENE-1888), so if that
field was indexed with FieldType.setStoreTermVectorPayloads they should be
there.

But I suspect the TokenSources.getTokenStream API (which I think un-inverts
the term vectors to recreate the token stream = very slow?) wasn't fixed to
also carry the payloads through?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 23, 2013 at 7:10 AM, Uwe Schindler  wrote:

> TermVectors are per-document and do not contain payloads. You are reading
> the per-document TermVectors which is a "small index" *stored* for each
> document as a binary blob. This blob only contains the terms of this
> document with its positions/offsets, but no payloads (offsets are used e.g.
> for highlighting).
>
> To retrieve payloads, you have to use the main TermsEnum and main posting
> lists, but this does *not* work per document. In general you would execute
> a query and then retrieve the payload for each hit while iterating the
> scorer (e.g. function queries can do this).
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
> > Sent: Tuesday, April 23, 2013 1:04 PM
> > To: java-user
> > Subject: Reading Payloads
> >
> > Hi,
> > I'm trying to extract payloads from an index for specific tokens the
> following
> > way (inserting sample document number and term):
> >
> > Terms terms = reader.getTermVector(16504, "term"); TokenStream
> > tokenstream = TokenSources.getTokenStream(terms);
> > while (tokenstream.incrementToken()) {
> >   OffsetAttribute offset =
> tokenstream.getAttribute(OffsetAttribute.class);
> >   int start = offset.startOffset();
> >   int end = offset.endOffset();
> >   String token =
> > tokenstream.getAttribute(CharTermAttribute.class).toString();
> >
> >   PayloadAttribute payloadAttr =
> > tokenstream.addAttribute(PayloadAttribute.class);
> >   BytesRef payloadBytes = payloadAttr.getPayload();
> >
> >   ...
> > }
> >
> > This works fine for the OffsetAttribute and the CharTermAttribute, but
> > payloadAttr.getPayload() always returns null for all documents and all
> > tokens, unfortunately. However, I know that the payloads are stored in
> the
> > index as I can retrieve them through a SpanQuery with
> Spans.getPayload(). I
> > actually expect every token to carry a payload, as I'm my custom
> tokenizer
> > implementation has the following lines:
> >
> > public class KoraTokenizer extends Tokenizer {
> >   ...
> >   private PayloadAttribute payloadAttr =
> > addAttribute(PayloadAttribute.class);
> >   ...
> >   public boolean incrementToken() {
> > ...
> > payloadAttr.setPayload(new BytesRef(payloadString));
> > ...
> >   }
> >   ...
> > }
> >
> > I've asserted that the payloadString variable is never an empty String
> and as I
> > said above, I can retrieve the Payloads with Spans.getPayload(). So what
> do I
> > do wrong in my
> > tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> > tokenstream.getAttribute() before as for the other attributes but this
> > obviously threw an IllegalArgumentException so I implemented the
> > recommendation given in the documentation and replaced it by
> > addAttribute().
> >
> > Thanks!
> > Carsten
> >
> >
> >
> >
> > --
> > Institut für Deutsche Sprache | http://www.ids-mannheim.de
> > Projekt KorAP | http://korap.ids-mannheim.de
> > Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> > Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> > Analysis Platform
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober

Am 23.04.2013 13:21, schrieb Michael McCandless:
> Actually, term vectors can store payloads now (LUCENE-1888), so if that
> field was indexed with FieldType.setStoreTermVectorPayloads they should be
> there.
> 
> But I suspect the TokenSources.getTokenStream API (which I think un-inverts
> the term vectors to recreate the token stream = very slow?) wasn't fixed to
> also carry the payloads through?

I use the following FieldType:

private final static FieldType textFieldWithTermVector = new
FieldType(TextField.TYPE_STORED);
textFieldWithTermVector.setStoreTermVectors(true);
textFieldWithTermVector.setStoreTermVectorPositions(true);
textFieldWithTermVector.setStoreTermVectorOffsets(true);
textFieldWithTermVector.setStoreTermVectorPayloads(true);

So I suppose your assumption is right that the
TokenSources.getTokensStream API is not ready to make use of this.

I'm trying to figure out a way to use a query as Uwe suggested. My
scenario is to perform a query and then retrieve some of the payloads
upon user request, so there no obvious way to wrap this into a query as
I can't know what (terms) to query for.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

How to use TokenStream build two fields

2013-04-23 Thread 808

I am a lucene user from China,so my English is bad.I will try my best to 
explain my problem.
The version I use is 4.2.I have a problem during I use lucene .
Here is my code:
public void testIndex() throws IOException, SQLException {
NewsDao ndao = new NewsDao();
List newsList = ndao.getNewsListAll();
Analyzer analyzer = new IKAnalyzer(true);
Directory directory = FSDirectory.open(new 
File(INDEX_DRICTORY));


IndexWriterConfig config = new IndexWriterConfig(MatchVersion, 
analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);


IndexWriter writer = new IndexWriter(directory, config);
StringField idField = new StringField("nid", String.valueOf(0),
Field.Store.YES);
TokenStream title_ts = null;
TokenStream content_ts = null;
for (News n : newsList) {
Document doc = new Document();
idField.setStringValue(String.valueOf(n.getId()));
content_ts = analyzer.tokenStream("content", new 
StringReader(HTMLFilter
.delHTMLTag(n.getNewsContext(;
title_ts = analyzer.tokenStream("title",new 
StringReader(n.getNewsTitle()));
getTokens(content_ts);
doc.add(idField);
doc.add(new TextField("content", content_ts));
doc.add(new TextField("title", title_ts));
writer.addDocument(doc);
}
if (content_ts != null) {
try {
content_ts.close();
} catch (IOException e) {
e.printStackTrace();
}
}
writer.close(true);
directory.close();
}



I just want to use TokenStream to get the tokenized result,but I met 
NullpointException as following:
Exception in thread "main" java.lang.NullPointerException
at 
org.wltea.analyzer.core.AnalyzeContext.fillBuffer(AnalyzeContext.java:124)
at org.wltea.analyzer.core.IKSegmenter.next(IKSegmenter.java:122)
at 
org.wltea.analyzer.lucene.IKTokenizer.incrementToken(IKTokenizer.java:78)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129)
at manage.lucene.LuceneTools.testIndex(LuceneTools.java:130)
at manage.lucene.LuceneTools.main(LuceneTools.java:95)

How can I solve this problem.Thanks~
Read

Re: How to use TokenStream build two fields

2013-04-23 Thread Simon Willnauer

hey there,

I think your english is perfectly fine! Given the info you provided
it's very hard to answer your question... I can't look into
org.wltea.analyzer.core.AnalyzeContext.fillBuffer(AnalyzeContext.java:124)
 but apparently there is a nullpointer happening here. maybe you can
track that down to this class or debug it but from my perspective we
can't really help here.

simon

On Tue, Apr 23, 2013 at 1:51 PM, 808  wrote:
> I am a lucene user from China,so my English is bad.I will try my best to 
> explain my problem.
> The version I use is 4.2.I have a problem during I use lucene .
> Here is my code:
> public void testIndex() throws IOException, SQLException {
> NewsDao ndao = new NewsDao();
> List newsList = ndao.getNewsListAll();
> Analyzer analyzer = new IKAnalyzer(true);
> Directory directory = FSDirectory.open(new 
> File(INDEX_DRICTORY));
>
>
> IndexWriterConfig config = new 
> IndexWriterConfig(MatchVersion, analyzer);
> config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>
>
> IndexWriter writer = new IndexWriter(directory, config);
> StringField idField = new StringField("nid", 
> String.valueOf(0),
> Field.Store.YES);
> TokenStream title_ts = null;
> TokenStream content_ts = null;
> for (News n : newsList) {
> Document doc = new Document();
> idField.setStringValue(String.valueOf(n.getId()));
> content_ts = analyzer.tokenStream("content", new 
> StringReader(HTMLFilter
> .delHTMLTag(n.getNewsContext(;
> title_ts = analyzer.tokenStream("title",new 
> StringReader(n.getNewsTitle()));
> getTokens(content_ts);
> doc.add(idField);
> doc.add(new TextField("content", content_ts));
> doc.add(new TextField("title", title_ts));
> writer.addDocument(doc);
> }
> if (content_ts != null) {
> try {
> content_ts.close();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> writer.close(true);
> directory.close();
> }
>
>
>
> I just want to use TokenStream to get the tokenized result,but I met 
> NullpointException as following:
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.wltea.analyzer.core.AnalyzeContext.fillBuffer(AnalyzeContext.java:124)
> at org.wltea.analyzer.core.IKSegmenter.next(IKSegmenter.java:122)
> at 
> org.wltea.analyzer.lucene.IKTokenizer.incrementToken(IKTokenizer.java:78)
> at 
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
> at 
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
> at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
> at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148)
> at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129)
> at manage.lucene.LuceneTools.testIndex(LuceneTools.java:130)
> at manage.lucene.LuceneTools.main(LuceneTools.java:95)
>
> How can I solve this problem.Thanks~
> Read

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober

Am 23.04.2013 13:47, schrieb Carsten Schnober:
> I'm trying to figure out a way to use a query as Uwe suggested. My
> scenario is to perform a query and then retrieve some of the payloads
> upon user request, so there no obvious way to wrap this into a query as
> I can't know what (terms) to query for.

I wonder: is there a way to perform a (Span)Query restricting the search
to tokens within certain offsets in a document, e.g. by a Filter?
Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

org.apache.lucene.classification - bug in SimpleNaiveBayesClassifier

2013-04-23 Thread Alexey Anatolevitch

Hi,

Anybody is actively working on the classification package?

I was trying it with 4.2.1 and SimpleNaiveBayesClassifier seems to have a
bug - the local copy of BytesRef referenced by foundClass is affected by
subsequent TermsEnum.iterator.next() calls as the shared BytesRef.bytes
changes... I can provide a test case if that was not clear.

I believe it's either BytesRef.clone() that needs to create a full copy of
the underlying array, or a local fix SimpleNaiveBayesClassifier to actually
copy bytes instead of clone()

Alexey

Re: Reading Payloads

2013-04-23 Thread Alan Woodward

There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, 
etc.  Is that the sort of thing you're looking for?

Alan Woodward
www.flax.co.uk


On 23 Apr 2013, at 13:36, Carsten Schnober wrote:

> Am 23.04.2013 13:47, schrieb Carsten Schnober:
>> I'm trying to figure out a way to use a query as Uwe suggested. My
>> scenario is to perform a query and then retrieve some of the payloads
>> upon user request, so there no obvious way to wrap this into a query as
>> I can't know what (terms) to query for.
> 
> I wonder: is there a way to perform a (Span)Query restricting the search
> to tokens within certain offsets in a document, e.g. by a Filter?
> Thanks!
> Carsten
> 
> -- 
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober

Am 23.04.2013 15:27, schrieb Alan Woodward:
> There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, 
> etc.  Is that the sort of thing you're looking for?

Hi Alan,
thanks for the pointer, this is the right direction indeed. However,
these queries are based on a SpanQuery which depends on a specific
expression to search for. In my use case, I need to retrieve Spans
specified by their offsets only, and then get their payloads and process
them further. Alternatively, I could query for the occurence of certain
string patterns in the payloads and check the offsets subsequently, but
either way I'm no longer interested in the actual term at that point.
I don't see a way to do this with these Query type, or is there?
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Reading Payloads

2013-04-23 Thread Alan Woodward

Hi Carsten,

It doesn't sound as though an inverted index is really what you want to be 
querying here, if I'm reading you right.  You want to get the payloads for 
spans at a specific position, but you don't particularly care about the actual 
term at that position?  You might find that BinaryDocValues are a better fit 
here, but it's difficult to tell without knowing what your actual use case is.

Alan Woodward
www.flax.co.uk


On 23 Apr 2013, at 15:06, Carsten Schnober wrote:

> Am 23.04.2013 15:27, schrieb Alan Woodward:
>> There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, 
>> etc.  Is that the sort of thing you're looking for?
> 
> Hi Alan,
> thanks for the pointer, this is the right direction indeed. However,
> these queries are based on a SpanQuery which depends on a specific
> expression to search for. In my use case, I need to retrieve Spans
> specified by their offsets only, and then get their payloads and process
> them further. Alternatively, I could query for the occurence of certain
> string patterns in the payloads and check the offsets subsequently, but
> either way I'm no longer interested in the actual term at that point.
> I don't see a way to do this with these Query type, or is there?
> Carsten
> 
> 
> -- 
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re: Reading Payloads

2013-04-23 Thread Carsten Schnober

Am 23.04.2013 16:17, schrieb Alan Woodward:

> It doesn't sound as though an inverted index is really what you want to be 
> querying here, if I'm reading you right.  You want to get the payloads for 
> spans at a specific position, but you don't particularly care about the 
> actual term at that position?  You might find that BinaryDocValues are a 
> better fit here, but it's difficult to tell without knowing what your actual 
> use case is.

Hi Alan,
you are right that this specific aspect is not really suitable for an
inverted index. I've still been hoping that I could misuse it for some
cases. Let me sketch my use case:
A user performs a query that is parsed and executed in the form of a
SpanQuery. The offsets of the match(es) are extracted and returned. From
that point on, the user uses these offsets to retrieve certain segments
of a document from an external database.
However, I also store additional information (linguistic annotations) in
the token payloads because they are also used for more complex queries
that filter matches depending on these payloads. As they are stored in
the index anyway, I thought I could as well extract them upon request. I
am aware that such a request wouldn't perform very well, but apart from
that, I think it would be very handy if I were able to extract the
payloads for a given span.
However, I can't find a way other than via TokenSources.getTokenStream;
but that doesn't work apparently.
I'm now thinking about storing the resulting Spans in memory so that I
could extract the payloads upon user request. However, that still
wouldn't allow me to extract the payloads of any other token which would
be a typical use case when a user wants to retrieve annotations for
adjacent tokens, for example.
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

?????? How to use TokenStream build two fields

2013-04-23 Thread 808

Hello!
Thank you for your reply.It is my oversight that I did not append the code at 
(AnalyzeContext.java:124).
But when I try to use the StandardAnalyzer to do the same thing ,I met the same 
Exception.
Here is my code(IndexWriter has already been initialized):
private static void indexFile(IndexWriter writer, File f)
throws IOException {
Analyzer analyzer = new StandardAnalyzer(MatchVersion);
if (f.isHidden() || !f.exists() || !f.canRead()) {
return;
}
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = new Document();
Reader reader = new FileReader(f);
TokenStream ts = analyzer.tokenStream("contents", reader);
doc.add(new TextField("contents",ts));
TokenStream fileName_ts = analyzer.tokenStream("name", new 
StringReader(f.getName()));
doc.add(new TextField("name", fileName_ts));//here is 
lucene.demo.Indexer.indexFile(Indexer.java:85)
writer.addDocument(doc);
ts.close();
ts.close();
fileName_ts.close();
}

The Exception Myeclipse give me as follow:
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
at 
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)
at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:50)
at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129)
at lucene.demo.Indexer.indexFile(Indexer.java:85)
at lucene.demo.Indexer.indexDirectory(Indexer.java:66)
at lucene.demo.Indexer.indexDirectory(Indexer.java:64)
at lucene.demo.Indexer.Index(Indexer.java:53)
at lucene.demo.Indexer.main(Indexer.java:33)

I debuged the code.It is true that the Document built by TokenStream is null.
In addition , when I try to use a single TokenStream to built a Field is 
nothing wrong with it,
and I can use the method incrementToken() get tokenized result successfully.

Thanks for your help again.

Read

--  --
??: "Simon Willnauer";
: 2013??4??23??(??) 8:35
??: "java-user"; 

: Re: How to use TokenStream build two fields

hey there,

I think your english is perfectly fine! Given the info you provided
it's very hard to answer your question... I can't look into
org.wltea.analyzer.core.AnalyzeContext.fillBuffer(AnalyzeContext.java:124)
 but apparently there is a nullpointer happening here. maybe you can
track that down to this class or debug it but from my perspective we
can't really help here.

simon

On Tue, Apr 23, 2013 at 1:51 PM, 808  wrote:
> I am a lucene user from China,so my English is bad.I will try my best to 
> explain my problem.
> The version I use is 4.2.I have a problem during I use lucene .
> Here is my code:
> public void testIndex() throws IOException, SQLException {
> NewsDao ndao = new NewsDao();
> List newsList = ndao.getNewsListAll();
> Analyzer analyzer = new IKAnalyzer(true);
> Directory directory = FSDirectory.open(new 
> File(INDEX_DRICTORY));
>
>
> IndexWriterConfig config = new 
> IndexWriterConfig(MatchVersion, analyzer);
> config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>
>
> IndexWriter writer = new IndexWriter(directory, config);
> StringField idField = new StringField("nid", 
> String.valueOf(0),
> Field.Store.YES);
> TokenStream title_ts = null;
> TokenStream content_ts = null;
> for (News n : newsList) {
> Document doc = new Document();
> idField.setStringValue(String.valueOf(n.getId()));
>

Reading Payloads

RE: Reading Payloads

Re: Reading Payloads

Re: Reading Payloads

How to use TokenStream build two fields

Re: How to use TokenStream build two fields

Re: Reading Payloads

org.apache.lucene.classification - bug in SimpleNaiveBayesClassifier

Re: Reading Payloads

Re: Reading Payloads

Re: Reading Payloads

Re: Reading Payloads

?????? How to use TokenStream build two fields

13 matches

Site Navigation

Mail list logo

Footer information