Reading Payloads

2013-04-23 Thread Carsten Schnober
Hi,
I'm trying to extract payloads from an index for specific tokens the
following way (inserting sample document number and term):

Terms terms = reader.getTermVector(16504, term);
TokenStream tokenstream = TokenSources.getTokenStream(terms);
while (tokenstream.incrementToken()) {
  OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
  int start = offset.startOffset();
  int end = offset.endOffset();
  String token =
tokenstream.getAttribute(CharTermAttribute.class).toString();

  PayloadAttribute payloadAttr =
tokenstream.addAttribute(PayloadAttribute.class);
  BytesRef payloadBytes = payloadAttr.getPayload();

  ...
}

This works fine for the OffsetAttribute and the CharTermAttribute, but
payloadAttr.getPayload() always returns null for all documents and all
tokens, unfortunately. However, I know that the payloads are stored in
the index as I can retrieve them through a SpanQuery with
Spans.getPayload(). I actually expect every token to carry a payload, as
I'm my custom tokenizer implementation has the following lines:

public class KoraTokenizer extends Tokenizer {
  ...
  private PayloadAttribute payloadAttr =
addAttribute(PayloadAttribute.class);
  ...
  public boolean incrementToken() {
...
payloadAttr.setPayload(new BytesRef(payloadString));
...
  }
  ...
}

I've asserted that the payloadString variable is never an empty String
and as I said above, I can retrieve the Payloads with
Spans.getPayload(). So what do I do wrong in my
tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
tokenstream.getAttribute() before as for the other attributes but this
obviously threw an IllegalArgumentException so I implemented the
recommendation given in the documentation and replaced it by addAttribute().

Thanks!
Carsten




-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 13:21, schrieb Michael McCandless:
 Actually, term vectors can store payloads now (LUCENE-1888), so if that
 field was indexed with FieldType.setStoreTermVectorPayloads they should be
 there.
 
 But I suspect the TokenSources.getTokenStream API (which I think un-inverts
 the term vectors to recreate the token stream = very slow?) wasn't fixed to
 also carry the payloads through?

I use the following FieldType:

private final static FieldType textFieldWithTermVector = new
FieldType(TextField.TYPE_STORED);
textFieldWithTermVector.setStoreTermVectors(true);
textFieldWithTermVector.setStoreTermVectorPositions(true);
textFieldWithTermVector.setStoreTermVectorOffsets(true);
textFieldWithTermVector.setStoreTermVectorPayloads(true);

So I suppose your assumption is right that the
TokenSources.getTokensStream API is not ready to make use of this.

I'm trying to figure out a way to use a query as Uwe suggested. My
scenario is to perform a query and then retrieve some of the payloads
upon user request, so there no obvious way to wrap this into a query as
I can't know what (terms) to query for.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 13:47, schrieb Carsten Schnober:
 I'm trying to figure out a way to use a query as Uwe suggested. My
 scenario is to perform a query and then retrieve some of the payloads
 upon user request, so there no obvious way to wrap this into a query as
 I can't know what (terms) to query for.

I wonder: is there a way to perform a (Span)Query restricting the search
to tokens within certain offsets in a document, e.g. by a Filter?
Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 15:27, schrieb Alan Woodward:
 There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, 
 etc.  Is that the sort of thing you're looking for?

Hi Alan,
thanks for the pointer, this is the right direction indeed. However,
these queries are based on a SpanQuery which depends on a specific
expression to search for. In my use case, I need to retrieve Spans
specified by their offsets only, and then get their payloads and process
them further. Alternatively, I could query for the occurence of certain
string patterns in the payloads and check the offsets subsequently, but
either way I'm no longer interested in the actual term at that point.
I don't see a way to do this with these Query type, or is there?
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reading Payloads

2013-04-23 Thread Carsten Schnober
Am 23.04.2013 16:17, schrieb Alan Woodward:

 It doesn't sound as though an inverted index is really what you want to be 
 querying here, if I'm reading you right.  You want to get the payloads for 
 spans at a specific position, but you don't particularly care about the 
 actual term at that position?  You might find that BinaryDocValues are a 
 better fit here, but it's difficult to tell without knowing what your actual 
 use case is.

Hi Alan,
you are right that this specific aspect is not really suitable for an
inverted index. I've still been hoping that I could misuse it for some
cases. Let me sketch my use case:
A user performs a query that is parsed and executed in the form of a
SpanQuery. The offsets of the match(es) are extracted and returned. From
that point on, the user uses these offsets to retrieve certain segments
of a document from an external database.
However, I also store additional information (linguistic annotations) in
the token payloads because they are also used for more complex queries
that filter matches depending on these payloads. As they are stored in
the index anyway, I thought I could as well extract them upon request. I
am aware that such a request wouldn't perform very well, but apart from
that, I think it would be very handy if I were able to extract the
payloads for a given span.
However, I can't find a way other than via TokenSources.getTokenStream;
but that doesn't work apparently.
I'm now thinking about storing the resulting Spans in memory so that I
could extract the payloads upon user request. However, that still
wouldn't allow me to extract the payloads of any other token which would
be a typical use case when a user wants to retrieve annotations for
adjacent tokens, for example.
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter instead? 
 Since you are not doing any scoring (only filtering), the max boolean clauses 
 limit should not apply to a filter.

Hi Sujit,
thanks for your suggestion! I wasn't aware that the max clause limit
does not match for a BooleanQuery wrapped in a filter. I suppose the
ideal way would be to use a BooleanFilter but not a QueryWrapperFilter,
right?

However, I am also not sure how to apply a filter in my use case because
I perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits
object as an argument (acceptDocs), I haven't been able to figure out
how to generate this Bits object correctly from a Filter object.

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 11:27, schrieb Uwe Schindler:

Hi again,

 You are somehow misusing acceptDocs and DocIdSet here, so you have
 to take care, semantics are different:
 - For acceptDocs null means all documents allowed - no deleted
 documents
 - For DocIdSet null means no documents matched

 Okay, as described above, I would now pass either the result of
 getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument to
 getDocIdSet():

 MapTerm, TermContext termContexts = new HashMap();
 AtomicReaderContext atomic = ...
 ChainedFilter filter = ...
 
 You just pass getLiveDocs(), no null check needed. Using your code would 
 bring a slowdown for indexes without deletions.

This makes sense to me, but now I get zero matches in all searches using
the filter. I am pondering this remark in the documentation of
Filter.getDocIdSet(AtomicReaderContext context, Bits acceptDocs):
acceptDocs - Bits that represent the allowable docs to match (typically
deleted docs but possibly filtering other documents)

I understand that getLiveDocs() returns the document bits set that
represent NON-deleted documents which seems to match the first part of
the description (allowable docs). However, why does it say in brackets
typically deleted docs? I had ignored this so far, but as I get zero
results now, this might be relevant.

I am also thinking about how to possibly make use of a
BitsFilteredDocIdSet in the following kind:

ChainFilter filter = ...
AtomicReaderContext = ...

Bits alldocs = atomic.reader().getLiveDocs();
DocIdSet docids = filter.getDocIdSet(atomic, alldocs);
BitsFilteredDocIdSet filtered = new BitsFilteredDocIdSet(docids, alldocs);
Spans luceneSpans = sq.getSpans(atomic, filtered.bits(), termContexts);

However, the documentation of the constructor public
BitsFilteredDocIdSet(DocIdSet innerSet, Bits acceptDocs) does not make
it clear to me whether I am applying the arguments correcty. I fails
especially to understand the acceptDocs argument again:
acceptDocs - Allowed docs, all docids not in this set will not be
returned by this DocIdSet

Would this be the correct way to apply a filter on a SpanQuery?
Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 13:43, schrieb Uwe Schindler:

Hi,

 Passing NULL means all documents are allowed, if this would not be the case, 
 whole Lucene queries and filters would not work at all, so if you get 0 docs, 
 you must have missed something else. If this is not the case, your filter may 
 behave wrong. Look at e.g. FilteredQuery, IndexSearcher or any other query in 
 Lucene that handles acceptDocs - those pass getLiveDocs() down. If they are 
 null, that means all documents are allowed. The javadocs on Scorer/Filter/... 
 should be more clear about this. Can you open an issue about Javadocs?

I'll open an issue as soon as I have understood how this should be
corrected. :)
I think I've pin-pointed my problem: I use a TermsFilter, get a DocIdSet
with TermsFilter.getDocIdSet(atomic, atomic.reader().getLiveDocs()), and
eventually retrieve a Bits object from that with DocIdSet.bits().
However, the latter always returns null. Wrapping the TermsFilter into a
CachingWrapperFilter doesn't change that. I was using a
QueryWrapperFilter before which would give me a DocIdSet object from
which I could get a proper Bits object to pass to SpanQuery.getSpans().
Is there any way I could extract a Bits object from a TermsFilter?


 Would this be the correct way to apply a filter on a SpanQuery?
 
 new FilteredQuery(SpanQuery,Filter)?

Okay, I formulated the question wrongly. I need to call
SpanQuery.getSpans() because I have to process the resultings Spans
object. Therefore, I actually meant: what is the general way to generate
a Bits object from a Filter that can be used as the 'acceptedDocs' argument?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



No documents in TermsFilter.getDocIdSet()

2013-04-15 Thread Carsten Schnober
Hi,
tying in with the previous thread Statically store sub-collections for
search, I'm trying to focus on the root of the problem that has
occurred to me.

At first, I generate a TermsFilter with potentially many terms in one term:

-
ListTerm docnames = new ArrayList(resource.getDocIDs().size());
for (String docid : resource.getDocIDs()) {
  docnames.add(new Term(id, docid));
}
TermsFilter filter = new TermsFilter(docnames);
-

This filter is used to generate a DocIdSet object holding the allowable
documents in a loop over the atomic segments of my IndexReader reader:

-
for (AtomicReaderContext atomic : reader.leaves()) {
  DocIdSet docids = filter.getDocIdSet(atomic,
atomic.reader().getLiveDocs());
  DocIdSetIterator iterator = docids.iterator();
  while (iterator.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
...
  }
  ...
}
-

The while-loop is never entered, i.e. there are no documents in docids.
However, it does return a DocIdSetIterator object and is not null. The
same technique works fine with another Filter (a QueryWrapperFilter). Is
this a bug or am I addressing the TermsFilter (or the resuling DocIdSet)
in the wrong way? Are there any working examples for how to get a
properly populated DocIdSet from a TermsFilter?

I read that the iterator() method has to be implemented for every
DocIdSet implementation. Also, TermsFilter.getDocIdSet() seems to return
null or a FixedBitSet which seems to implement its iterator() by an
OpenBitSetIterator.

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Statically store sub-collections for search (faceted search?)

2013-04-12 Thread Carsten Schnober
Dear list,
I would like to create a sub-set of the documents in an index that is to
be used for further searches. However, the criteria that lead to the
creation of that sub-set are not predefined so I think that faceted
search cannot be applied my this use case.

For instance:
A user searches for documents that contain token 'A' in a field 'text'.
These results form a set of documents that is persistently stored (in a
database). Each document in the index has a field 'id' that identifies
it, so these external IDs are stored in the database.

Later on, a user loads the document IDs from the database and wants to
execute another search on this set of documents only. However,
performing a search on the full index and subsequently filtering the
results against that list of documents takes very long if there are many
matches. This is obvious as I have to retrieve the external id from each
matching document and check whether it is part of the desired sub-set.
Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not
suitable either because there could be thousands of documents exceeding
any limit for Boolean clauses.

Any suggestions how to solve this? I would have gone for the Lucene
document numbers and store them as a bit set that I could use as a
filter during later searches, but I read that the document numbers are
ephemeral.

One possible way out seems to be to create another index from the
documents that have matched the initial search, but this seems quite an
overkill, especially if there are plenty of them...

Thanks for any hint!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Update a bunch of documents

2013-04-11 Thread Carsten Schnober
Hi,
I have the following scenario: I have an index of very large size
(although I'm testing with around 200,000 documents, but should scale to
many millions) and I want to perform a search on a certain field.
According to that search, I would like to manipulate a different field
for all the matching documents.
The only approach I could come up with so far would be to load the
matching documents ids into a Collector, iterate over them, load the
Document objects with IndexReader.document(docid), and manipulate them
one by one. Finally, I would delete all the documents matching the
initial query with IndexWriter.deleteDocuments(Query query) and write
the edited ones with IndexWriter.addDocuments(Iterable? extends
Iterable? extends IndexableField docs)

However, the iteration seems to be very time-consuming as it can concern
large portions of the indexed documents and I wonder if there is a
smarter way to perform the document manipulation. This is limited to one
field only (not the one on which the query is typically performed!),
shouldn't that help?

Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Luke?

2013-03-14 Thread Carsten Schnober
Am 13.03.2013 10:23, schrieb dizh:
 I just recompile it.
 
 Luckily, It doesn't need to do much work. Only a few modifications according 
 to Lucene4.1 API change doc.

That's great news. Are you going to publish a ready-made version somewhere?
Also, I've made the experience that Luke 4.0.0-ALPHA cannot deal with
indexes in which term vectors are stored. This might as well be caused
by the fact that custom field types are in use.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rewrite for RegexpQuery

2013-03-12 Thread Carsten Schnober
Am 11.03.2013 18:22, schrieb Michael McCandless:
 On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober
 schno...@ids-mannheim.de wrote:
 Am 11.03.2013 13:38, schrieb Michael McCandless:
 On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler u...@thetaphi.de wrote:

 Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this 
 should work (after rewrite your query is a BooleanQuery, which supports 
 extractTerms()).

 ... as long as you don't exceed the max number of terms allowed by BQ
 (1024 by default, but you can raise it).

 True, I've noticed this meanwhile. Are there any recommendations for
 this setting where the limit is as large as possible while staying
 within a reasonable performance? Of course, this is highly subjective,
 but what's the magnitude here? Will a limit of 1,024,000 typically
 increase the query time by the factor 1,000 too?
 Carsten
 
 I think 1024 may already be too high ;)
 
 But really it depends on your situation: test different limits and see.
 
 How much slower a larger query is depends on the specifics of the terms ...

For the purpose of initial testing, I've increased the limit by the
factor 1,000. As Uwe pointed out, I don't actually execute the query,
but only extract the terms. In this regard, there are no performance
issues with thousands of terms, although I will have to perform a
systematic evaluation yet.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rewrite for RegexpQuery

2013-03-12 Thread Carsten Schnober
Am 12.03.2013 10:39, schrieb Uwe Schindler:

 I would suggest to use my example code with the fake query and custom 
 rewrite. This does not have the overhead of BooleanQuery and more important: 
 You don't need to change the *global* and *static* default in BooleanQuery. 
 Otherwise you could introduce a denial of service case into your application, 
 if you at some other place execute a wildcard using Boolean rewrite with 
 unlimited number of terms.

Hi Uwe,
many thanks for your code sample! I've made tiny adaptations in
GetTermsRewrite to make the overridden methods match their counterparts
in the superclass (ScoringRewrite). I suppose that your version was not
written for Lucene 4.0, right? It looks like this now:

final class GetTermsRewrite extends ScoringRewriteTermHolderQuery {
@Override
protected void addClause(TermHolderQuery topLevel, Term term, int
docCount, float boost, TermContext states) {
  topLevel.add(term);
}

@Override
protected TermHolderQuery getTopLevelQuery() {
  return new TermHolderQuery();
}

@Override
protected void checkMaxClauseCount(int count) throws IOException {
// TODO Auto-generated method stub

}
}


I'm not sure what checkMaxClauseCount() is supposed to do though, but
apart from that, everything works great. Thanks!


The code I use for calling this:

IndexSearcher searcher = ...;
String query = ...;

MultiTermQuery query = new RegexpQuery(new Term(text, query));
query.setRewriteMethod(new GetTermsRewrite());
TermHolderQuery queryRewritten = (TermHolderQuery) searcher.rewrite(query);
SetTerm terms = queryRewritten.getTerms();


There's another thing that is not entirely clear to me: when calling
query.setRewriteMethod(new GetTermsRewrite()), does this really apply to
the IndexSearcher in the sense that IndexSearcher.rewrite() uses the
given rewrite method? It seems to work fine, but I am not sure why it
does and whether it always will do.

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Term Statistics for MultiTermQuery

2013-03-12 Thread Carsten Schnober
Hi,
here's another question involving MultiTermQuerys. My aim is to get a
frequency count for a MultiTermQuery while I don't need to execute the
query. The naive approach would be to create the Query, extract the
terms, and get each term's frequency, approximately as follows:

IndexSearcher searcher = ...;
PrefixQuery query = new PrefixQuery(new Term(field, abc));
Query rewritten = searcher.rewrite(query);
SetTerm terms = rewritten.extractTerms();
...

And eventually read the term frequencies for each term. However, this
seems rather costly for a large number of terms and I am actually
interested in the total frequencies, so there would be no need for a
term-by-term analysis.
My use case is that I have an index containing part-of-speech tags in
the form tag:token and I may be searching for tag frequencies.
My alternative solution would be to create a dedicated index in which
the original tokens are completely replaced by the tags, so that I had
documents in the form DET NN ... and corresponding tokens. Would you
rather recommend this?

Thanks,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Hi,
I'm trying to get the terms that match a certain RegexpQuery. My (naive)
approach:

1. Create a RegexpQuery from the queryString (e.g. abc.*):
Query q = new RegexpQuery(new Term(text, queryString));

2. Rewrite the Query using the IndexReader reader:
q = q.rewrite(reader);

3. Write the terms into a previously initialized empty set terms:
SetTerm terms = new HashSet();
q.extractTerms(terms);

However, this results in an empty set. I believe this is due to the fact
that the rewritten query is a ConstantScoreQuery object;
q.extractTerms(terms) does not yield any terms anyway. q.getQuery()
returns null however; according to the documentation, this should happen
when it wraps a filter which it does not, supposedly.
This is Lucene 4.0. Any hints?
Thanks!
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 12:08, schrieb Uwe Schindler:

 This works for this query, but in general you have to rewrite until it is 
 completely rewritten: A while loop that exits when the result of the rewrite 
 is identical to the original query. IndexSearcher.rewrite() does this for 
 you. 
 
 3. Write the terms into a previously initialized empty set terms:
 SetTerm terms = new HashSet();
 q.extractTerms(terms);
 
 Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this 
 should work (after rewrite your query is a BooleanQuery, which supports 
 extractTerms()).

This does work for my case, thank you! For the matter of completeness,
the full solution (for my specific case) is as follows:

SetTerm terms = new HashSet();
MultiTermQuery query = new RegexpQuery(new Term(text, query));
query.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
BooleanQuery bq = (BooleanQuery) query.rewrite(reader);
bq.extractTerms(terms);


Regarding the application of IndexSearcher.rewrite(Query) instead: I
don't see a way to set the rewrite method there because the Query's
rewrite method does not seem to apply to IndexSearcher.rewrite().

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 14:13, schrieb Uwe Schindler:

 Regarding the application of IndexSearcher.rewrite(Query) instead: I don't
 see a way to set the rewrite method there because the Query's rewrite
 method does not seem to apply to IndexSearcher.rewrite().
 
 Replace:
 BooleanQuery bq = (BooleanQuery) query.rewrite(reader);
 
 With:
 BooleanQuery bq = (BooleanQuery) indexSearcher.rewrite(query);
 
 Otherwise you have to create a while-loop that rewrites the return value 
 again until rewrite() returns itself.

Right, I was under the false impression that the rewrite method set for
the initial query was not considered by IndexSearcher.rewrite().
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 13:38, schrieb Michael McCandless:
 On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler u...@thetaphi.de wrote:
 
 Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this 
 should work (after rewrite your query is a BooleanQuery, which supports 
 extractTerms()).
 
 ... as long as you don't exceed the max number of terms allowed by BQ
 (1024 by default, but you can raise it).

True, I've noticed this meanwhile. Are there any recommendations for
this setting where the limit is as large as possible while staying
within a reasonable performance? Of course, this is highly subjective,
but what's the magnitude here? Will a limit of 1,024,000 typically
increase the query time by the factor 1,000 too?
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ProximityQueryNode

2013-02-21 Thread Carsten Schnober
Hi,
I'm interested in the functionality supposedly implemented through
ProximityQueryNode. Currently, it seems like it is not used by the
default QueryParser or anywhere else in Lucene, right? This makes
perfectly sense since I don't see a Lucene index store any notion of
sentences, paragraphs, etc. Is that right too?
I would be interested whether anyone (else) is working on implementing
this into some query parser and on any theoretical and practical
approaches about indexing the given types. Also, I think that the type
should (at some point in the future) be more flexible than the given
values enumerated in the class so that one could also index arbitrary
custom units, e.g. pages, discourse units, syntactic chunks, etc.

My current approach on indexing sentence and paragraph information is to
store them in token payloads and then perform a check in matching tokens
whether their respective sentences match the given distance query. Any
better ideas?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ANTLR and Custom Query Syntax/Parser

2013-01-30 Thread Carsten Schnober
Am 29.01.2013 00:24, schrieb Trejkaz:
 On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin
 and...@andrewgilmartin.com wrote:
 When I first started using Lucene, Lucene's Query classes where not suitable
 for use with the Visitor pattern and so I created my own query class
 equivalants and other more specialized ones. Lucene's classes might have
 changed since then (I do not know)
 
 On that subject, the infrastructure behind StandardQueryParser is
 along those lines. Query itself is still not very flexible, but
 QueryNode is much more convenient and there are processors for walking
 the tree to do transformations.
 
 We ended up using ANTLR to do the syntax parsing for our stuff and
 then using most of the standard transformations as-is, decorated in
 some cases (either to customise or to work around bugs.) Of course we
 had to add our own for all the new features, but we got a fair bit of
 reuse out of the new framework.

Hi,
thanks for your hints, everyone! I am still a little bit puzzled about
where to start though.
The general task is to generate SpanQueries from the tree provided by
the ANTLR query syntax parser. The special feature about that query
language (that I have not specified and that I cannot change) is that
there are binary operators such as /s0 indicating that the payloads of
two tokens have to be identical and implying an AND. The query A /s0 B
means find documents that contain A AND B where A and B have identical
payloads.
My intuitive solution would be to make a filter from a BooleanQuery with
A AND B and apply that filter in two separate SpanTermQuerys for A and
for B respectively. Then, I would perform an intersection on the hits
based on the payloads.
However, I am still puzzled how to approach this coming from a an
Antlr-generated tree. This may be due to a certain lack of routine
dealing with Antlr output, but when the parser returns an object of some
subclass of RuleReturnScope, how would I be able to derive appropriate
Lucene Query subclasses?
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 scalability and performance.

2012-12-24 Thread Carsten Schnober
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


 This means that we need to index millions of document with TeraBytes of 
 content and search in it.
 For now we want to define only one indexed field, contained the content of 
 the documents, with possibility to search terms and retrieving the terms 
 offsets.
 Does somebody already tested Lucene with TerabBytes of data?
 Does Lucene has some known limitations related to the indexed documents 
 number or to the indexed documents size?
 What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and
have performed a few tests with large text corpora. There are at least
some overlaps in the functionality you mentioned (term offsets). See
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly
section 5).
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Match intersection by Payload

2012-12-19 Thread Carsten Schnober
Hi,
I have a search scenario in which I search for multiple terms and retain
only those matches that share a common payload. I'm using this  to
search for multiple terms that occur all in one sentence; I've stored a
sentence ID in the payload for each token.

So far, I've done so by specifying a list of terms, create a
BooleanQuery that connects these terms (as in [house, car]) with
Occur.MUST. That BooleanQuery is wrapped into a filter.
In the next step, I perform a separate SpanQuery for each of the terms
(one for house and one for car), using the previously created
filter's DocIdSet to restrict the search to documents that contain all
of the terms, e.g. for house:

SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper(new
RegexpQuery(house)).rewrite(reader);

The resulting spans are stored in a map with the terms as keys and the
matching Spans as values. Finally, I retain only those matches that have
the same payload (=sentence) in the same document.

This works well for ordinary terms and is reasonably fast since the
SpanQuerys are typically restricted to a manageable document set.
However, I would prefer to use the Lucene query language rather than
specifying a static list of terms, especially because I'd like to have
features such as regular expressions, wildcards, ranges etc.
However, this makes the above solution impossible because the
QueryParser can evaluate what is meant to be one term (e.g. hous*)  to
multiple ones (house, houses). Then, the intersection as described
above does not make sense any longer: I don't want sentences that
contain both house and houses, but sentences that contain either one
and car too.

I have three potential solutions in my mind:

1. Track back the terms generated by a rewritten MultiTermQuery
I could try to figure out automatically whether the terms retrieved from
the StandardQueryParser should be unionised (as they are derieved from
the same term (as in hous*) or intersected (as hous* and car). I'm
not sure how to do that reliably though because the single terms are
extracted only after generating a Query through a StandardQueryParser
and thus there is no distinction between these terms.

2. Implement my own QueryParser that makes distinguishes between terms
that are derived from one regex (hous*) and those that are derived
from another (car). In that case, the scenario from 1. with unions and
intersections would be easy, logically at least.

3. Use a PayloadTermQuery. In that case, I'd hope to throw away the
apparently redundant query generation (one for the filter and one for
the SpanQuery and substitute it by a Query that makes matching payloads
a pre-condition. I'm not sure how to do that either as I don't know
beforehand which payload string to match, it just has to be the same for
the different terms.

All these ways seem equally promising (and complicated) to me, so would
you have some advice which one seems more realistic to lead to an actual
solution?

Thanks,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean and SpanQuery: different results

2012-12-19 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky:
 Can you provide some examples of terms that don't work and the index
 token stream they fail on?
 
 Make sure that the Analyzer you are using doesn't do any magic on the
 indexed terms - your query term is unanalyzed. Maybe multiple, but
 distinct, index terms are analyzing to the same, but unexpected term.

Apart from the answer I've already given myself, here's another note
about the issue. I've been using WhitespaceAnalyzer for both indexing
and query parsing, but apparently, the query parser lowercased by
default while WhitespaceAnalyzer did not. Therefore,
QueryParser.setLowercaseExpandedTerms(false) is necessary in order to
get the same results.

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Carsten Schnober
Am 18.12.2012 12:36, schrieb Michael McCandless:
 On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
 schno...@ids-mannheim.de wrote:


 This is a relatively easy example, but how would deal with e.g.
 annotations that include multiple tokens (as in spans), such as chunks,
 or relations between tokens (and token spans), as in the coreference
 links example given by Steven above?
 
 I think you'd do something like what SynonymFilter does for
 multi-token synonyms.
 
 Eg a synonym for wireless network -  wifi would insert a new token
 (wifi), overlapped on wireless.
 
 Lucene doesn't store the end span, but if this is really important for
 your use case, you could add a payload to that wifi token that would
 encode the number of positions that the inserted token spans (2 in
 this case), and then the information would be present in the index.
 
 You'd still need to do something custom at read/search time to decode
 this end position and do something interesting with it ...

Thanks for the pointer!
I'm still puzzled whether something there is an optimal way to encode
(labelled) relations between tokens or even spans; the latter part would
probably lead back to the synonym-like solution.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean and SpanQuery: different results

2012-12-17 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky:
 Can you provide some examples of terms that don't work and the index
 token stream they fail on?
 
 Make sure that the Analyzer you are using doesn't do any magic on the
 indexed terms - your query term is unanalyzed. Maybe multiple, but
 distinct, index terms are analyzing to the same, but unexpected term.

I've done some further analysis and it turns out that for some reason,
the SpanQuery described previously returns matches for the first entry
(in 18 existing ones) in the list returned by reader.leaves().

As stated in my first post in this thread, my code builds a SpanQuery
for each AtomicReaderContext in a list retrieved through
MultiReader.leaves(). That SpanQuery is identical to a BooleanQuery with
TermQueries for the exactly same terms performed with
IndexSearcher.search() on that same MultiReader.

The document ids of the hits found through the SpanQuery correspond to
the ones returned by the BooleanQuery for the same term. However, the
documents returned by the BooleanQuery that do not lye within the first
AtomicReaderContext are not found by the SpanQuery.

Might this have to do with the docbase? I collect the document IDs from
the BooleanQuery through a Collector, adding the actual ID to the
current AtomicReaderContext.docbase. In the corresponding SpanQuery, I
pass these document IDs as a DocIdBitSet as an argument to
SpanQuery.getSpans().

Thanks!
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean and SpanQuery: different results

2012-12-17 Thread Carsten Schnober
Am 17.12.2012 11:54, schrieb Carsten Schnober:

 Might this have to do with the docbase? I collect the document IDs from
 the BooleanQuery through a Collector, adding the actual ID to the
 current AtomicReaderContext.docbase. In the corresponding SpanQuery, I
 pass these document IDs as a DocIdBitSet as an argument to
 SpanQuery.getSpans().

Answering my own question that has made me think about the document base
issue: indeed, I should be collecting document IDs relative to their
respective AtomicReaderContext rather than adding the context's docbase
because the subsequent SpanQuery is performed within an
AtomicReaderContext as well.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 12:27, schrieb Michael McCandless:

 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.
 
 So for example part-of-speech is a per-Token-position attribute.
 
 Today the easiest way to handle this is to encode these attributes
 into a Payload, which is straightforward (make a custom TokenFilter
 that creates the payload).
 
 At search time you would then use e.g. PayloadTermQuery to decode the
 Payload and do something with it to alter how the query is being
 scored.

This is a relatively easy example, but how would deal with e.g.
annotations that include multiple tokens (as in spans), such as chunks,
or relations between tokens (and token spans), as in the coreference
links example given by Steven above?
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Boolean and SpanQuery: different results

2012-12-13 Thread Carsten Schnober
Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E).

The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
ListString terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
  bq.add(new org.apache.lucene.search.RegexpQuery(new
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())
  bits.set(doc.doc);

/* get the spans for each term separately */
for (String term : terms) {
  String payloads = retrieveSpans(term, bits);
  // process and print payloads for term ...
}

def String retrieveSpans(String term, BitSet bits) {
  StringBuilder payloads = new StringBuilder();
  MapTerm, TermContext termContexts = new HashMap();
  Spans spans;
  SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper(new
RegexpQuery(new Term(text, term))).rewrite(reader);

  for (AtomicReaderContext atomic : reader.leaves()) {  
spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
while (luceneSpans.next()) {
  // extract and store payloads in 'payloads' StringBuilder
}
  }
  return payloads.toString();
}


This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean and SpanQuery: different results

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 18:00, schrieb Jack Krupansky:
 Can you provide some examples of terms that don't work and the index
 token stream they fail on?

The index I'm testing with is German Wikipedia and I've been testing
with different (arbitrarily chosen) terms. I'm listing some results, the
first number is the number of documents matched with a BooleanQuery, the
second number is the number of documents matches with a SpanQuery:

- Knacklaut 24/19
- schönes   70/70
- zufällige 71/70
- wunderbar 24/24
- Himmel773/753
- Sonne 1190/1152


 Make sure that the Analyzer you are using doesn't do any magic on the
 indexed terms - your query term is unanalyzed. Maybe multiple, but
 distinct, index terms are analyzing to the same, but unexpected term.

I'm using a custom Analyzer during indexing. Regarding the analyzer
applied during search, I'm not sure: as I haven't defined any specific
one, what does Lucene choose? I wasn't thinking about that because I
assumed that this should make no difference regarding the BooleanQuery
vs. SpanQuery issue.
Thanks for the hint anyway, I'll have a closer look there.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SpanQuery and Bits

2012-12-06 Thread Carsten Schnober
Hi,
I have a problem understanding and applying the BitSets concept in
Lucene 4.0. Unfortunately, there does not seem to be a lot of
documentation about the topic.

The general task is to extract Spans matching a SpanQuery which works
with the following snippet:

for (AtomicReaderContext atomic : reader.getContext().leaves()) {   
  Spans spans = query.getSpans(atomic, new Bits.MatchAllBits(0),
termContexts);
  while (spans.next()) {
// extract payloads etc.
  }
}

I understand that the acceptDocs parameter in SpanQuery.getSpans()
restricts the search to a set of documents. In the example given above,
it searches all documents (Bits.MatchAllBits), right?

What I would like to do is generate a Bits object that is based on a
BooleanQuery beforehand in order to restrict the search through
getSpans() to a set of documents that contain certain terms.
I also have a MultiReader object that handles multiple indexes.
My intuitive approach would be to apply a QueryWrapperFilter like this:

MultiReader reader = ...
BooleanQuery bq = ...
DocIdSet bitset = ???;
Filter filter = new QueryWrapperFilter(bq);
for (AtomicReaderContext context = reader.getContext().leaves()) {
  filter.getDocIdSet(context, new Bits.MatchAllBits(0))
}

The obvious question is: how do I handle the context bitsets returned by
getDocIdSet() correctly so that I can pass the 'bitset' variable to the
getSpans() call?

Or am I on the wrong path for this kind of problem?
Thanks!
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Specialized Analyzer for names

2012-11-23 Thread Carsten Schnober
Hi,
I'm indexing names in a dedicated Lucene field and I wonder which
analyzer to use for that purpose. Typically, the names are in the format
John Smith, so the WhitespaceAnalyzer is likely the best in most
cases. The field type to choose seems to be the TextField.
Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
cautious about that because I'm afraid of wildcard or regex queries such
as *Smith or .*Smith respectively.

However, there might also be special cases and spelling exceptions of
all kinds, e.g. Smith, John, John 'Hammmer' Smith, Abd al-Aziz,
Stan van Hoop and what else one could imagine. Is there a special
Analyzer that is optimized on dealing with such cases or do I have to do
normalization beforehand?
I see that such special characters and spellings can easily be covered
by the right queries, but that requires the user to know the exact
spelling, which is what I'm trying to spare her.

Best regards,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Potential Resource Leak warning in Analyer.createComponents()

2012-11-21 Thread Carsten Schnober
Hi,
I use a custom analyzer and tokenizer. The analyzer is very basic and it
merely comprises the method createComponents():

-
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
  return new TokenStreamComponents(new KoraTokenizer(reader));
}
-

Eclipse gives me a warning though potential resource leak because the
tokenizer is never closed. This is clearly true but is it not desirable
either, is it?
To get rid of the warning, I had experimentally changed the method to this:

Tokenizer source = new KoraTokenizer(reader);
TokenStreamComponents ts = new TokenStreamComponents(source);
source.close();
return ts;

This yields what I had expected, namely a null TokenStream during
analysis. So regarding the results, I think the initial version is
right, but I am suspicious against the resource leak warning. How
serious is it?
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi,

 However, after switching to Lucene 4 and TokenStreamComponents, I'm
 getting a strange behaviour: only the first document in the collection
 is tokenized properly. The others do appear in the index, but
 un-tokenized, although I have tried not to change anything in the logic.
 The Analyzer now has this createComponents() method calling the custom
 TokenStreamComponents class with my custom Tokenizer:

After some debugging, it turns out that the Analyer method
createComponents() is called only once, for the first document. This
seems to be the problem, the other documents are just not analyzed.
Here's the loop that creates the fields and supposedly calls the
analyzer. Does anyone have a hint why this does only happend for the
first document; the loop itself runs once for every document though:

---

Listde.ids_mannheim.korap.main.Document documents;
Version lucene_version = Version.LUCENE_40;
Analyzer analyzer = new KoraAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
[...]

for (de.ids_mannheim.korap.main.Document doc : documents) {
  luceneDocument = new Document();

  /* Store document name/ID */
  Field idField = new StringField(titleFieldName, doc.getDocid(),
Field.Store.YES);

  /* Store tokens */
  String layerFile = layer.getFile();
  Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
Field.Store.YES);

  luceneDocument.add(textFieldAnalyzed);
  luceneDocument.add(idField);

  try {
writer.addDocument(luceneDocument);
  } catch (IOException e) {
jlog.error(Error adding document
+doc.getDocid()+:\n+e.getLocalizedMessage());
  }
}
[...]
writer.close();
---

The class de.ids_mannheim.korap.main.Document defines our own document
objects from which the relevant information can be read as shown in the
loop. The list 'documents' is filled in in intermediately called method.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 20.11.2012 10:22, schrieb Uwe Schindler:

Hi,

 The createComponents() method of Analyzers is only called *once* for each 
 thread and the Tokenstream is *reused* for later documents. The Analyzer will 
 call the final method Tokenizer#setReader() to notify the Tokenizer of a new 
 Reader (this method will update the protected input field in the Tokenizer 
 base class) and then it will reset() the whole tokenization chain. The custom 
 TokenStream components must initialize themselves with the new settings on 
 the reset() method.

Thanks, Uwe!
I think what changed in comparison to Lucene 3.6 is that reset() is
called upon initialization, too, instead of after processing the first
document only, right? Apart from the fact that it used not to be
obligatory to make all components reuseable, I suppose.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TokenStreamComponents in Lucene 4.0

2012-11-19 Thread Carsten Schnober
Hi,
I have recently updated to Lucene 4.0, but having problems with my
custom Analyzer/Tokenizer.

In the days of Lucene 3.6, it would work like this:

0. define constants lucene_version and indexdir
1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer)
2. create an IndexWriterConfiguration: config = new
IndexWriterConfig(lucene_version, analyzer)
3. create an IndexWriter writer = (indexdir, config)
4. for each document:
4.1. create a Document: Document doc = new Document()
4.2. create a Field: Field field = new Field(text, layerFile,
Field.Store.YES, Field.Index.ANALYZED_NO_NORMS,
Field.TermVector.WITH_POSITIONS_OFFSETS);
4.3. add field to document: doc.add(field)
4.4. add document to writer: writer.add(doc)
5. close the writer (write to disk)

However, after switching to Lucene 4 and TokenStreamComponents, I'm
getting a strange behaviour: only the first document in the collection
is tokenized properly. The others do appear in the index, but
un-tokenized, although I have tried not to change anything in the logic.
The Analyzer now has this createComponents() method calling the custom
TokenStreamComponents class with my custom Tokenizer:

@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
  final Tokenizer source = new KoraTokenizer(reader);
  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
  try {
source.close();
  } catch (IOException e) {
jlog.error(e.getLocalizedMessage());
e.printStackTrace();
  }
  return tokenstream;
}


The custom TokenStreamComponents class uses this constructor:

public KoraTokenStreamComponents(Tokenizer tokenizer) {
  super(tokenizer);
  try {
tokenizer.reset();
  } catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
  }
}


Since I have not changed anything in the Tokenizer, I suspect the error
to be in the new class KoraTokenStreamComponents. This may be due to the
fact that I do not fully understand why the TokenStreamComponents class
has been introduced.
Any hints on that? Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenStreamComponents in Lucene 4.0

2012-11-19 Thread Carsten Schnober
Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi again,
just a little update:

 However, after switching to Lucene 4 and TokenStreamComponents, I'm
 getting a strange behaviour: only the first document in the collection
 is tokenized properly. The others do appear in the index, but
 un-tokenized, although I have tried not to change anything in the logic.
 The Analyzer now has this createComponents() method calling the custom
 TokenStreamComponents class with my custom Tokenizer:
 
 @Override
 protected TokenStreamComponents createComponents(String fieldName,
 Reader reader) {
   final Tokenizer source = new KoraTokenizer(reader);
   final TokenStreamComponents tokenstream = new
 KoraTokenStreamComponents(source);
   try {
 source.close();
   } catch (IOException e) {
 jlog.error(e.getLocalizedMessage());
 e.printStackTrace();
   }
   return tokenstream;
 }

When using the packaged Analyzer.TokenStreamComponents class instead of
my custom KoraTokenStreamComponents class, the behaviour does not seem
to change:

-  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
+  final TokenStreamComponents tokenstream = new
TokenStreamComponents(source);

Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SpanQuery, Filter, BooleanQuery

2012-10-30 Thread Carsten Schnober
Am 29.10.2012 13:40, schrieb Carsten Schnober:

 Now, I'd like to add the option to filter the resulting Spans object by
 another WildcardQuery on a different field that contains document
 titles. My intuitive approach would have been to use a filter like this:

I'd like to conclude my previous post in a less elaborate way: I need to

a) combine two WildcardQueries so that I can still use
SpanMultiTermQueryWrapper to generate a SpanQuery.

b) apply a filter to a WildcardQuery so that the WildcardQuery's results
are reduced before converting it to a SpanQuery using
SpanMultiTermQueryWrapper.

Answer b) seems intuitively the way to go there, but I don't quite find
the correct path there because the filter does not work as intended (see
my previous post).
Answer a) does not seem feasible here either because
SpanMultiTermQueryWrapper requires a MultiTermQuery, but not a BooleanQuery.

Any hints on that?
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SpanQuery, Filter, BooleanQuery

2012-10-29 Thread Carsten Schnober
Hi,
I've got a setup in which I would like to perform an arbitrary query
over one field (typically realised through a WildcardQuery) and the
matches are returned as a SpanQuery because the result payloads are
further processed using Span.next() and Span.getPayload(). This works
fine with the following code (extract), using Lucene 4.0.0:

-
// these fields are initialized externally through public methods:
private final MultiReader reader;
private final String termString;
private final String fieldname;
private final int maxHits;

private MapTerm, TermContext termContexts = new HashMap();
WildcardQuery wildcard;
Term term = new Term(fieldname, termString);
SpanQuery query;// Lucene query
Spans luceneSpans;

wildcard = new WildcardQuery(term);
query = (SpanQuery) new
SpanMultiTermQueryWrapper(wildcard).rewrite(reader);
spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts);

for (AtomicReaderContext atomic : reader.getContext().leaves()) {
  spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts);
  while (luceneSpans.next()  total = maxHits) {
...
  }
}
-

Now, I'd like to add the option to filter the resulting Spans object by
another WildcardQuery on a different field that contains document
titles. My intuitive approach would have been to use a filter like this:

Filter filter = new QueryWrapperFilter(new WildcardQuery(new
Term(titlefield, titles)));

The filter is applied in a dedicated method with this line:

DocIdSet matchingTitleIDs = filter.getDocIdSet(context, new
Bits.MatchAllBits(0));

And subsequently, the getSpan() call from above is substituted by:

spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts);

However, this yields either a NullPointerException when there are no
hits or does not affect the results at all in comparison to no filtering.

I've come across the thread lucene-4.0: QueryWrapperFilter  docBase
[1] in which Uwe suggests not to use QueryWrapperFilter, but to use
another Query and to combine it using a Boolean Query in such a
scenario, if I understand correctly. Does this still claim for Lucene 4.0?
However, I am not sure how to use a BooleanQuery here because I need the
SpanQuery result.

Any thoughts about what I'm doing wrong and how to fix this?
Thank you very much!
Carsten


[1]
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201210.mbox/%3CCABY_-Z7r=z0301yf1-1uvbqyw3jf48srpuhe6syt1eh28vn...@mail.gmail.com%3E

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene in Corpus Linguistics

2012-09-26 Thread Carsten Schnober

Hi,
in case someone is interested in an application of the Lucene indexing 
engine in the field of corpus linguistics rather than information 
retrieval: we have worked on that subject for some time and have 
recently published a conference paper about it:

http://korap.ids-mannheim.de/2012/09/konvens-proceedings-online/

Central issues addressed in this work have been to externally produced 
and concurrent tokenizations as well as multiple linguistic annotations 
on different levels.


Best,
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



UnsupportedOperationException: Query should have been rewritten

2012-08-14 Thread Carsten Schnober
Dear list,
I am trying to combine a WildcardQuery and a SpanQuery because I need to
extract spans from the index for further processing. I realise that
there have been a few public discussions about this topic around, but I
still fail to get what I am missing here. My code is this (Lucene 3.6.0):

==
WildcardQuery wildcard = new WildcardQuery(new Term(field, bro*));
SpanQuery query = new SpanMultiTermQueryWrapperWildcardQuery(wildcard);   

// query = query.rewrite(reader);   
Spans luceneSpans = query.getSpans(reader);
==

This throws the following exception:
==
Exception in thread main java.lang.UnsupportedOperationException:
Query should have been rewritten at
org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.getSpans(SpanMultiTermQueryWrapper.java:114)
==

I am basically aware of the problem that I cannot apply a MultiTermQuery
instance (like a WildcardQuery) without calling rewrite(), but on the
other hand, rewrite() returns a Query object that I cannot use as a
SpanQuery instance.

I'm almost sure that there is a reasonable solution for this problem
that I am not able to spot. Or do I have to migrate either to Lucene 4
or use a SpanRegexQuery instead which I do not really want to because it
is marked as deprecated.

Thank you very much!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: UnsupportedOperationException: Query should have been rewritten

2012-08-14 Thread Carsten Schnober
Am 14.08.2012 11:00, schrieb Uwe Schindler:
 You have to rewrite the wrapper query.

Thanks, Uwe! I had tried that way but it failed because the rewrite()
method would return a Query (not a SpanQuery) object. A cast seems to
solve the problem, I'm re-posting the code snippet to the list for the
sake of completeness:


WildcardQuery wildcard = new WildcardQuery(new Term(field, bro*));
SpanQuery query = (SpanQuery) new
SpanMultiTermQueryWrapperWildcardQuery(wildcard).rewrite(reader); 
Spans spans = query.getSpans(reader);


All I am still wondering about is whether this cast is totally safe,
i.e. robust to all kinds of variable search terms.

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 06.08.2012 20:29, schrieb Mike Sokolov:

Hi Mike,

 There was some interesting work done on optimizing queries including
 very common words (stop words) that I think overlaps with your problem.
 See this blog post
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
 from the Hathi Trust.
 
 The upshot in a nutshell was that queries including terms with very
 large postings lists (ie high occurrences) were slow, and the approach
 they took to dealing with this was to index n-grams (ie pairs and
 triplets of adjacent tokens).  However I'm not sure this would help much
 if your queries will typically include only a single token.

This is very interesting for our use case indeed. However, you are right
that indexing n-grams is not (per sé) a solution for my given problem
because I'm working on an application using multiple indexes. A query
for one isolated frequent term will indeed be rare presumably, or at
least rare enough to tolerate slow response times, but the results will
typically be intersected with results from other indexes.

To illustrate this more practically: the index I described having
relatively few distinct and partially extremely frequent tokens indexes
part-of-speech (POS) tags with positional information stored in the
payload. A parallel index indexes actual text; a typical query may look
for a certain POS tag in one index and a word X at the same position
with a matching payload in the other index. So both indexes need to be
queries completely before the intersection can be performed.

Best,
Carsten



-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 07.08.2012 10:20, schrieb Danil ŢORIN:

Hi Danil,

 If you do intersection (not join), maybe it make sense to put every
 thing into 1 index?

Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be ready for both, but intersections are clearly more complicated.

 Just transform your input like brown fox into ADJ:brown|your
 payload NOUN:fox|other payload

I understand that this denotes ADJ and NOUN to be interpreted as the
actual token and brown and fox as payloads (followed by other
payload), right?

This is a very neat approach and I have vaguely considered that. One
problem is that I aim for a very high level of flexibility, meaning that
additional annotations have to be addable at any point and different
tokenizations apply. However, I will re-consider your suggestion,
possibly applying one of multiple tokenizations as a default in this sense.

 Of course I'm not aware of all the details, so my solution might not
 be applicable to your project.
 Maybe you could share more details, so this won't transform in XY problem.
 
 Keep in mind : always optimize your index for the query usecase,
 instead of blindly processing the input data.

Thanks for that reminder; this becomes quite difficult in my scenario
though since we want to allow for flexible changes in the index types,
representing different annotations, tokenization logics etc.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Hi Danil,

 Just transform your input like brown fox into ADJ:brown|your
 payload NOUN:fox|other payload
 
 I understand that this denotes ADJ and NOUN to be interpreted as the
 actual token and brown and fox as payloads (followed by other
 payload), right?

Sorry for replying to myself, but I've realised only now that you
probably meant to replace the full token string (brown) by ADJ:brown
and use the payload otherwise, right? Regarding incoming queries, this
method makes it necessary to perform a Wildcard query (e.g. NOUN:*)
when I am not interested in the actual text (brown) -- which may
happen more or less frequently -- am I right? However, this might be an
acceptable trade-off...
Best regards,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Small Vocabulary

2012-08-02 Thread Carsten Schnober
Am 31.07.2012 12:10, schrieb Ian Lea:

Hi Ian,

 Lucene 4.0 allows you to use custom codecs and there may be one that
 would be better for this sort of data, or you could write one.
 
 In your tests is it the searching that is slow or are you reading lots
 of data for lots of docs?  The latter is always likely to be slow.
 General performance advice as in
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be
 relevant.  SSDs and loads of RAM never hurt.

You are very right, therer are many results from many docs for the
slower searches performed on that index. However, I am still wondering
about the theoretical implications: having a small vocabulary with many
tokens in an inverted index would yield a rather long list of
occurrences for some/many/all (depending on the actual distribution) of
the search terms.
Thanks for your pointer to the codecs in Lucene 4, I suppose that this
will be the actual point to attack for that scenario. It may be a silly
question, but one that might be of interest for the whole community ;-)
: can someone point me to an in-depth documentation of Lucene 4 codecs,
ideally covering both theoretical backgrounds and implementation? There
are numerous helpful blog entries, presentations, etc. available on the
net, but in case there is some central instance, I have not been able to
find it anyway.
Thanks!
Best regards,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Small Vocabulary

2012-07-30 Thread Carsten Schnober
Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance ART ADV ADJA NN.
The aim is to be able to search (efficiently) for occurrences of ADJA.

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as ART, ADV, ADJA, etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Offsets in 3.6/4.0

2012-07-17 Thread Carsten Schnober
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de:

Dear Karsten,

 abstract of your post:
 you need the offset to perform your search/ranking like the position is 
 needed for phrase queries.
 You are using reader.getTermFreqVector to get the offset. 
 This is to slow for your application and you think about a switch to version 
 4.0

Yes, that's about it.

 imho you should using payloads.
 You also could switch to version 4 because in version you can store the 
 offset to each term like the position in version 3x.
 But this is basically the same as the use of payloads:
  * http://lucene.apache.org/core/3_6_0/fileformats.html#Positions
  * 
 http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Positions

I now use payloads and this fulfils my functional requirements. I was
hoping to avoid that because I am also storing other information in the
Payload which makes it feel a bit messy; especially as it seemed
sensible to me to actually make use of the Offsets field as it already
exists. Anyway, the problem is solved so far, thank you very much!

I still wonder what the purpose of the Offset field is as it is so
inefficient to access. It seems like a wasteful redundancy to even store
the Offsets during indexing, considering that I also store it as a
payload. Or am I missing something?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Offsets in 3.6/4.0

2012-07-13 Thread Carsten Schnober
Dear list,
I am working on a search application that depends on retrieving offsets
for each match. Currently (in Lucene 3.6), this seems to be overly
costly, at least in my solution that looks like this:

---
TermPositionVector tfv;
int index;
TermVectorOffsetInfo[] offsets;

tfv = (TermPositionVector) reader.getTermFreqVector(docid, fieldname);
index = tfv.indexOf(term.text());
offsets = tfv.getOffsets(index);
---
So I can user the suitable TermVectorOffsetInfo from the offsets[] array
to retrieve the offset information of a span. However, this slows down
the search to an unacceptable level.

Reviewing the thread 'Retrieving Offsets'
(http://lucene.472066.n3.nabble.com/Retrieving-offsets-td3658238.html)
indicates that there has not been any more efficient way to go in Lucene
3.6. Am I right?

However, I understand that the patch LUCENE-3684
(https://issues.apache.org/jira/browse/LUCENE-3684) has improved the
situation. I am wondering now whether this is worth migrating to Lucene
4.0 in terms of search performance. It is currently not entirely clear
to me, whether Lucene 4.0 alpha actually allows the retrieval of offsets
from an index without having to read the TermFreqVector though.

Who can give me some advise about the potential search performance gain
for such an application and ideally to some pointers about how to
resolve the problem?

Thank you very much,
Carsten Schnober


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field value vs TokenStream

2012-04-20 Thread Carsten Schnober
Am 18.04.2012 20:06, schrieb Uwe Schindler:

Hi,

 You should inform yourself about the difference between stored and
 indexed fields: The tokens in the .tis file are in fact the analyzed
 tokens retrieved from the TokenStream. This is controlled by the Field
 parameter Field.Index. The Field.Store parameter has nothing to do with
 indexing: if a field is marked as stored, the full and unchanged string /
 binary is stored in the stored fields file (.fdt). Stored fields are used

Thanks for that clarification!
Best,
Carsten

-- 
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Field value vs TokenStream

2012-04-18 Thread Carsten Schnober
Dear list,
I'm studying the Lucene index file formats and I wonder: after having
initialized a field with Field(String name, String value, Field.Store
store, Field.Index index), where is the value String stored?

I understand that the chosen analyzer does its processing on that value,
including tokenization, and returns a TokenStream from which the Indexer
retrieves the attributes that it stores in the index.
When I use a binary editor to inspect the term infos (tis) file in the
index directory, I can see every single token (term).
For experimenting purposes, I implemented an analyzer that converts the
value input to the field and noticed the following: the TokenStream
still correctly generates the terms that end up to be stored in the tis
file, but the initial input value is still displayed as the field value
when I retrieve a document from the index and output it with
Document.toString(). I tried to analyse the Field's tokenStream, but
tokenStreamValue() returns null; is that normal when retrieving a
document from an existing index?

Can someone let me know what happens to a Field's value string and at
which point in the pipeline it is replaced by the (term) attributes
generated by the TokenStream?

Thank you very much!
Best,
Carsten


-- 
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Indexing Pre-analyzed Field

2012-04-11 Thread Carsten Schnober
Hi,
I've been wondering about best way to index a pre-analyzed field. With
pre-analyzed, I mean essentially one that I'd like to initialize with
the constructor Field(String name, TokenStream tokenStream).

There is a loop about a bunch of document, all with pre-defined
tokenizations that are stored in the variable tokenizations. One by one,
the Lucene documents are added to the writer. The writer is an
IndexWriter object that has been initialized and configured before.

I have implemented a custom TokenStream class for that purpose, so I've
approached the problem like the following:

CustomTokenStream ts = new CustomTokenStream();
for (tokenization : tokenizations) {
idField = new Field(id, doc.getDocid(), Field.Store.YES,
Field.Index.NOT_ANALYZED);

ts.setTokenization(tokenization);   
textField = new Field(text, ts);

luceneDocument.add(idField);
luceneDocument.add(textField);
try {
writer.addDocument(luceneDocument);
} catch (IOException e) {
System.err.println(Error adding 
document:\n+e.getLocalizedMessage());
}
}

The problem is clearly that I cannot query the text field, can I?

I've tried other ways though like initializing the text field with

textField = new Field(String name, String value, Field.Store.YES,
Field.Index.ANALYZED)

and setting

textField.setTokenStream(ts);


However, this does not seem to make sense since I don't want to use a
Lucene built-in analyzer and I'm not quite clear about what I should use
for the value in the latter approach.

Any help is very welcome! Thank you very much!
Best regards,
Carsten

-- 
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Apply custom tokenization

2012-03-06 Thread Carsten Schnober
Dear list,
I have a quite specific issue on which I would appreciate very much
having some thoughts before I start the actual implementation. Here's my
task description:
I would like to index corpora that have already been tokenized by an
external tokenizer. This tokenization is stored in an external file and
is the one I want to use for the Lucene index too. For each document,
there is a file that describes each token in the document by character
offsets, e.g. token start=0 end=3 /. Leave aside the XML format,
I'll write an appropriate XML parser so that we just have that
tokenization information.
I do not want do to any additional analysis on the input text, i.e. no
stopword filtering etc.; each token that is specified in the external
tokenization is supposed to result in an indexed token.

My approach to achieve this goal would be to implement an Analyzer that
reads the external tokenization information and generates a TokenStream
containing all the Token objects with offsets set according to the
external tokenization, i.e. without an own Tokenizer implementation. I'm
working with Lucene 3.5, which is why one very concrete question at this
point is: how would you implement this using the Attribute interface;
still use Token objects or can/should I work around them at all? The
documentation is quite vague about that point and so is the Lucene in
Action (2nd ed.) textbook.

The background is that I need to allow different tokenizations, so there
will potentially be multiple indexes for a text. Queries will have to be
tokenized by a user-defined tokenizer and the suitable index will then
be searched. So what are your thoughts about that approach? Is it the
right strategy for the task? Please recall that a given fact is that the
tokenization has to be read from an external file.

In general, I am afraid that the Lucene almost hardwires the analysis
process. Even though it does allow for custom tokenizers to be
implemented, it does not seem to intended that one does come up with a
completely self-made text analysis process, is it?

Thank you very much!
Carsten


-- 
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://www.ids-mannheim.de/kl/projekte/korap/
Tel.: +49-(0)621-1581-238

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org