Reading Payloads
Hi, I'm trying to extract payloads from an index for specific tokens the following way (inserting sample document number and term): Terms terms = reader.getTermVector(16504, term); TokenStream tokenstream = TokenSources.getTokenStream(terms); while (tokenstream.incrementToken()) { OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class); int start = offset.startOffset(); int end = offset.endOffset(); String token = tokenstream.getAttribute(CharTermAttribute.class).toString(); PayloadAttribute payloadAttr = tokenstream.addAttribute(PayloadAttribute.class); BytesRef payloadBytes = payloadAttr.getPayload(); ... } This works fine for the OffsetAttribute and the CharTermAttribute, but payloadAttr.getPayload() always returns null for all documents and all tokens, unfortunately. However, I know that the payloads are stored in the index as I can retrieve them through a SpanQuery with Spans.getPayload(). I actually expect every token to carry a payload, as I'm my custom tokenizer implementation has the following lines: public class KoraTokenizer extends Tokenizer { ... private PayloadAttribute payloadAttr = addAttribute(PayloadAttribute.class); ... public boolean incrementToken() { ... payloadAttr.setPayload(new BytesRef(payloadString)); ... } ... } I've asserted that the payloadString variable is never an empty String and as I said above, I can retrieve the Payloads with Spans.getPayload(). So what do I do wrong in my tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used tokenstream.getAttribute() before as for the other attributes but this obviously threw an IllegalArgumentException so I implemented the recommendation given in the documentation and replaced it by addAttribute(). Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reading Payloads
Am 23.04.2013 13:21, schrieb Michael McCandless: Actually, term vectors can store payloads now (LUCENE-1888), so if that field was indexed with FieldType.setStoreTermVectorPayloads they should be there. But I suspect the TokenSources.getTokenStream API (which I think un-inverts the term vectors to recreate the token stream = very slow?) wasn't fixed to also carry the payloads through? I use the following FieldType: private final static FieldType textFieldWithTermVector = new FieldType(TextField.TYPE_STORED); textFieldWithTermVector.setStoreTermVectors(true); textFieldWithTermVector.setStoreTermVectorPositions(true); textFieldWithTermVector.setStoreTermVectorOffsets(true); textFieldWithTermVector.setStoreTermVectorPayloads(true); So I suppose your assumption is right that the TokenSources.getTokensStream API is not ready to make use of this. I'm trying to figure out a way to use a query as Uwe suggested. My scenario is to perform a query and then retrieve some of the payloads upon user request, so there no obvious way to wrap this into a query as I can't know what (terms) to query for. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reading Payloads
Am 23.04.2013 13:47, schrieb Carsten Schnober: I'm trying to figure out a way to use a query as Uwe suggested. My scenario is to perform a query and then retrieve some of the payloads upon user request, so there no obvious way to wrap this into a query as I can't know what (terms) to query for. I wonder: is there a way to perform a (Span)Query restricting the search to tokens within certain offsets in a document, e.g. by a Filter? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reading Payloads
Am 23.04.2013 15:27, schrieb Alan Woodward: There's the SpanPositionCheckQuery family - SpanRangeQuery, SpanFirstQuery, etc. Is that the sort of thing you're looking for? Hi Alan, thanks for the pointer, this is the right direction indeed. However, these queries are based on a SpanQuery which depends on a specific expression to search for. In my use case, I need to retrieve Spans specified by their offsets only, and then get their payloads and process them further. Alternatively, I could query for the occurence of certain string patterns in the payloads and check the offsets subsequently, but either way I'm no longer interested in the actual term at that point. I don't see a way to do this with these Query type, or is there? Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Reading Payloads
Am 23.04.2013 16:17, schrieb Alan Woodward: It doesn't sound as though an inverted index is really what you want to be querying here, if I'm reading you right. You want to get the payloads for spans at a specific position, but you don't particularly care about the actual term at that position? You might find that BinaryDocValues are a better fit here, but it's difficult to tell without knowing what your actual use case is. Hi Alan, you are right that this specific aspect is not really suitable for an inverted index. I've still been hoping that I could misuse it for some cases. Let me sketch my use case: A user performs a query that is parsed and executed in the form of a SpanQuery. The offsets of the match(es) are extracted and returned. From that point on, the user uses these offsets to retrieve certain segments of a document from an external database. However, I also store additional information (linguistic annotations) in the token payloads because they are also used for more complex queries that filter matches depending on these payloads. As they are stored in the index anyway, I thought I could as well extract them upon request. I am aware that such a request wouldn't perform very well, but apart from that, I think it would be very handy if I were able to extract the payloads for a given span. However, I can't find a way other than via TokenSources.getTokenStream; but that doesn't work apparently. I'm now thinking about storing the resulting Spans in memory so that I could extract the payloads upon user request. However, that still wouldn't allow me to extract the payloads of any other token which would be a typical use case when a user wants to retrieve annotations for adjacent tokens, for example. Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Statically store sub-collections for search (faceted search?)
Am 12.04.2013 20:08, schrieb SUJIT PAL: Hi Carsten, Why not use your idea of the BooleanQuery but wrap it in a Filter instead? Since you are not doing any scoring (only filtering), the max boolean clauses limit should not apply to a filter. Hi Sujit, thanks for your suggestion! I wasn't aware that the max clause limit does not match for a BooleanQuery wrapped in a filter. I suppose the ideal way would be to use a BooleanFilter but not a QueryWrapperFilter, right? However, I am also not sure how to apply a filter in my use case because I perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits object as an argument (acceptDocs), I haven't been able to figure out how to generate this Bits object correctly from a Filter object. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Statically store sub-collections for search (faceted search?)
Am 15.04.2013 11:27, schrieb Uwe Schindler: Hi again, You are somehow misusing acceptDocs and DocIdSet here, so you have to take care, semantics are different: - For acceptDocs null means all documents allowed - no deleted documents - For DocIdSet null means no documents matched Okay, as described above, I would now pass either the result of getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument to getDocIdSet(): MapTerm, TermContext termContexts = new HashMap(); AtomicReaderContext atomic = ... ChainedFilter filter = ... You just pass getLiveDocs(), no null check needed. Using your code would bring a slowdown for indexes without deletions. This makes sense to me, but now I get zero matches in all searches using the filter. I am pondering this remark in the documentation of Filter.getDocIdSet(AtomicReaderContext context, Bits acceptDocs): acceptDocs - Bits that represent the allowable docs to match (typically deleted docs but possibly filtering other documents) I understand that getLiveDocs() returns the document bits set that represent NON-deleted documents which seems to match the first part of the description (allowable docs). However, why does it say in brackets typically deleted docs? I had ignored this so far, but as I get zero results now, this might be relevant. I am also thinking about how to possibly make use of a BitsFilteredDocIdSet in the following kind: ChainFilter filter = ... AtomicReaderContext = ... Bits alldocs = atomic.reader().getLiveDocs(); DocIdSet docids = filter.getDocIdSet(atomic, alldocs); BitsFilteredDocIdSet filtered = new BitsFilteredDocIdSet(docids, alldocs); Spans luceneSpans = sq.getSpans(atomic, filtered.bits(), termContexts); However, the documentation of the constructor public BitsFilteredDocIdSet(DocIdSet innerSet, Bits acceptDocs) does not make it clear to me whether I am applying the arguments correcty. I fails especially to understand the acceptDocs argument again: acceptDocs - Allowed docs, all docids not in this set will not be returned by this DocIdSet Would this be the correct way to apply a filter on a SpanQuery? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Statically store sub-collections for search (faceted search?)
Am 15.04.2013 13:43, schrieb Uwe Schindler: Hi, Passing NULL means all documents are allowed, if this would not be the case, whole Lucene queries and filters would not work at all, so if you get 0 docs, you must have missed something else. If this is not the case, your filter may behave wrong. Look at e.g. FilteredQuery, IndexSearcher or any other query in Lucene that handles acceptDocs - those pass getLiveDocs() down. If they are null, that means all documents are allowed. The javadocs on Scorer/Filter/... should be more clear about this. Can you open an issue about Javadocs? I'll open an issue as soon as I have understood how this should be corrected. :) I think I've pin-pointed my problem: I use a TermsFilter, get a DocIdSet with TermsFilter.getDocIdSet(atomic, atomic.reader().getLiveDocs()), and eventually retrieve a Bits object from that with DocIdSet.bits(). However, the latter always returns null. Wrapping the TermsFilter into a CachingWrapperFilter doesn't change that. I was using a QueryWrapperFilter before which would give me a DocIdSet object from which I could get a proper Bits object to pass to SpanQuery.getSpans(). Is there any way I could extract a Bits object from a TermsFilter? Would this be the correct way to apply a filter on a SpanQuery? new FilteredQuery(SpanQuery,Filter)? Okay, I formulated the question wrongly. I need to call SpanQuery.getSpans() because I have to process the resultings Spans object. Therefore, I actually meant: what is the general way to generate a Bits object from a Filter that can be used as the 'acceptedDocs' argument? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
No documents in TermsFilter.getDocIdSet()
Hi, tying in with the previous thread Statically store sub-collections for search, I'm trying to focus on the root of the problem that has occurred to me. At first, I generate a TermsFilter with potentially many terms in one term: - ListTerm docnames = new ArrayList(resource.getDocIDs().size()); for (String docid : resource.getDocIDs()) { docnames.add(new Term(id, docid)); } TermsFilter filter = new TermsFilter(docnames); - This filter is used to generate a DocIdSet object holding the allowable documents in a loop over the atomic segments of my IndexReader reader: - for (AtomicReaderContext atomic : reader.leaves()) { DocIdSet docids = filter.getDocIdSet(atomic, atomic.reader().getLiveDocs()); DocIdSetIterator iterator = docids.iterator(); while (iterator.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) { ... } ... } - The while-loop is never entered, i.e. there are no documents in docids. However, it does return a DocIdSetIterator object and is not null. The same technique works fine with another Filter (a QueryWrapperFilter). Is this a bug or am I addressing the TermsFilter (or the resuling DocIdSet) in the wrong way? Are there any working examples for how to get a properly populated DocIdSet from a TermsFilter? I read that the iterator() method has to be implemented for every DocIdSet implementation. Also, TermsFilter.getDocIdSet() seems to return null or a FixedBitSet which seems to implement its iterator() by an OpenBitSetIterator. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Statically store sub-collections for search (faceted search?)
Dear list, I would like to create a sub-set of the documents in an index that is to be used for further searches. However, the criteria that lead to the creation of that sub-set are not predefined so I think that faceted search cannot be applied my this use case. For instance: A user searches for documents that contain token 'A' in a field 'text'. These results form a set of documents that is persistently stored (in a database). Each document in the index has a field 'id' that identifies it, so these external IDs are stored in the database. Later on, a user loads the document IDs from the database and wants to execute another search on this set of documents only. However, performing a search on the full index and subsequently filtering the results against that list of documents takes very long if there are many matches. This is obvious as I have to retrieve the external id from each matching document and check whether it is part of the desired sub-set. Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not suitable either because there could be thousands of documents exceeding any limit for Boolean clauses. Any suggestions how to solve this? I would have gone for the Lucene document numbers and store them as a bit set that I could use as a filter during later searches, but I read that the document numbers are ephemeral. One possible way out seems to be to create another index from the documents that have matched the initial search, but this seems quite an overkill, especially if there are plenty of them... Thanks for any hint! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Update a bunch of documents
Hi, I have the following scenario: I have an index of very large size (although I'm testing with around 200,000 documents, but should scale to many millions) and I want to perform a search on a certain field. According to that search, I would like to manipulate a different field for all the matching documents. The only approach I could come up with so far would be to load the matching documents ids into a Collector, iterate over them, load the Document objects with IndexReader.document(docid), and manipulate them one by one. Finally, I would delete all the documents matching the initial query with IndexWriter.deleteDocuments(Query query) and write the edited ones with IndexWriter.addDocuments(Iterable? extends Iterable? extends IndexableField docs) However, the iteration seems to be very time-consuming as it can concern large portions of the indexed documents and I wonder if there is a smarter way to perform the document manipulation. This is limited to one field only (not the one on which the query is typically performed!), shouldn't that help? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Luke?
Am 13.03.2013 10:23, schrieb dizh: I just recompile it. Luckily, It doesn't need to do much work. Only a few modifications according to Lucene4.1 API change doc. That's great news. Are you going to publish a ready-made version somewhere? Also, I've made the experience that Luke 4.0.0-ALPHA cannot deal with indexes in which term vectors are stored. This might as well be caused by the fact that custom field types are in use. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rewrite for RegexpQuery
Am 11.03.2013 18:22, schrieb Michael McCandless: On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Am 11.03.2013 13:38, schrieb Michael McCandless: On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler u...@thetaphi.de wrote: Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this should work (after rewrite your query is a BooleanQuery, which supports extractTerms()). ... as long as you don't exceed the max number of terms allowed by BQ (1024 by default, but you can raise it). True, I've noticed this meanwhile. Are there any recommendations for this setting where the limit is as large as possible while staying within a reasonable performance? Of course, this is highly subjective, but what's the magnitude here? Will a limit of 1,024,000 typically increase the query time by the factor 1,000 too? Carsten I think 1024 may already be too high ;) But really it depends on your situation: test different limits and see. How much slower a larger query is depends on the specifics of the terms ... For the purpose of initial testing, I've increased the limit by the factor 1,000. As Uwe pointed out, I don't actually execute the query, but only extract the terms. In this regard, there are no performance issues with thousands of terms, although I will have to perform a systematic evaluation yet. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rewrite for RegexpQuery
Am 12.03.2013 10:39, schrieb Uwe Schindler: I would suggest to use my example code with the fake query and custom rewrite. This does not have the overhead of BooleanQuery and more important: You don't need to change the *global* and *static* default in BooleanQuery. Otherwise you could introduce a denial of service case into your application, if you at some other place execute a wildcard using Boolean rewrite with unlimited number of terms. Hi Uwe, many thanks for your code sample! I've made tiny adaptations in GetTermsRewrite to make the overridden methods match their counterparts in the superclass (ScoringRewrite). I suppose that your version was not written for Lucene 4.0, right? It looks like this now: final class GetTermsRewrite extends ScoringRewriteTermHolderQuery { @Override protected void addClause(TermHolderQuery topLevel, Term term, int docCount, float boost, TermContext states) { topLevel.add(term); } @Override protected TermHolderQuery getTopLevelQuery() { return new TermHolderQuery(); } @Override protected void checkMaxClauseCount(int count) throws IOException { // TODO Auto-generated method stub } } I'm not sure what checkMaxClauseCount() is supposed to do though, but apart from that, everything works great. Thanks! The code I use for calling this: IndexSearcher searcher = ...; String query = ...; MultiTermQuery query = new RegexpQuery(new Term(text, query)); query.setRewriteMethod(new GetTermsRewrite()); TermHolderQuery queryRewritten = (TermHolderQuery) searcher.rewrite(query); SetTerm terms = queryRewritten.getTerms(); There's another thing that is not entirely clear to me: when calling query.setRewriteMethod(new GetTermsRewrite()), does this really apply to the IndexSearcher in the sense that IndexSearcher.rewrite() uses the given rewrite method? It seems to work fine, but I am not sure why it does and whether it always will do. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Term Statistics for MultiTermQuery
Hi, here's another question involving MultiTermQuerys. My aim is to get a frequency count for a MultiTermQuery while I don't need to execute the query. The naive approach would be to create the Query, extract the terms, and get each term's frequency, approximately as follows: IndexSearcher searcher = ...; PrefixQuery query = new PrefixQuery(new Term(field, abc)); Query rewritten = searcher.rewrite(query); SetTerm terms = rewritten.extractTerms(); ... And eventually read the term frequencies for each term. However, this seems rather costly for a large number of terms and I am actually interested in the total frequencies, so there would be no need for a term-by-term analysis. My use case is that I have an index containing part-of-speech tags in the form tag:token and I may be searching for tag frequencies. My alternative solution would be to create a dedicated index in which the original tokens are completely replaced by the tags, so that I had documents in the form DET NN ... and corresponding tokens. Would you rather recommend this? Thanks, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Rewrite for RegexpQuery
Hi, I'm trying to get the terms that match a certain RegexpQuery. My (naive) approach: 1. Create a RegexpQuery from the queryString (e.g. abc.*): Query q = new RegexpQuery(new Term(text, queryString)); 2. Rewrite the Query using the IndexReader reader: q = q.rewrite(reader); 3. Write the terms into a previously initialized empty set terms: SetTerm terms = new HashSet(); q.extractTerms(terms); However, this results in an empty set. I believe this is due to the fact that the rewritten query is a ConstantScoreQuery object; q.extractTerms(terms) does not yield any terms anyway. q.getQuery() returns null however; according to the documentation, this should happen when it wraps a filter which it does not, supposedly. This is Lucene 4.0. Any hints? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rewrite for RegexpQuery
Am 11.03.2013 12:08, schrieb Uwe Schindler: This works for this query, but in general you have to rewrite until it is completely rewritten: A while loop that exits when the result of the rewrite is identical to the original query. IndexSearcher.rewrite() does this for you. 3. Write the terms into a previously initialized empty set terms: SetTerm terms = new HashSet(); q.extractTerms(terms); Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this should work (after rewrite your query is a BooleanQuery, which supports extractTerms()). This does work for my case, thank you! For the matter of completeness, the full solution (for my specific case) is as follows: SetTerm terms = new HashSet(); MultiTermQuery query = new RegexpQuery(new Term(text, query)); query.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); BooleanQuery bq = (BooleanQuery) query.rewrite(reader); bq.extractTerms(terms); Regarding the application of IndexSearcher.rewrite(Query) instead: I don't see a way to set the rewrite method there because the Query's rewrite method does not seem to apply to IndexSearcher.rewrite(). Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rewrite for RegexpQuery
Am 11.03.2013 14:13, schrieb Uwe Schindler: Regarding the application of IndexSearcher.rewrite(Query) instead: I don't see a way to set the rewrite method there because the Query's rewrite method does not seem to apply to IndexSearcher.rewrite(). Replace: BooleanQuery bq = (BooleanQuery) query.rewrite(reader); With: BooleanQuery bq = (BooleanQuery) indexSearcher.rewrite(query); Otherwise you have to create a while-loop that rewrites the return value again until rewrite() returns itself. Right, I was under the false impression that the rewrite method set for the initial query was not considered by IndexSearcher.rewrite(). Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Rewrite for RegexpQuery
Am 11.03.2013 13:38, schrieb Michael McCandless: On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler u...@thetaphi.de wrote: Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this should work (after rewrite your query is a BooleanQuery, which supports extractTerms()). ... as long as you don't exceed the max number of terms allowed by BQ (1024 by default, but you can raise it). True, I've noticed this meanwhile. Are there any recommendations for this setting where the limit is as large as possible while staying within a reasonable performance? Of course, this is highly subjective, but what's the magnitude here? Will a limit of 1,024,000 typically increase the query time by the factor 1,000 too? Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
ProximityQueryNode
Hi, I'm interested in the functionality supposedly implemented through ProximityQueryNode. Currently, it seems like it is not used by the default QueryParser or anywhere else in Lucene, right? This makes perfectly sense since I don't see a Lucene index store any notion of sentences, paragraphs, etc. Is that right too? I would be interested whether anyone (else) is working on implementing this into some query parser and on any theoretical and practical approaches about indexing the given types. Also, I think that the type should (at some point in the future) be more flexible than the given values enumerated in the class so that one could also index arbitrary custom units, e.g. pages, discourse units, syntactic chunks, etc. My current approach on indexing sentence and paragraph information is to store them in token payloads and then perform a check in matching tokens whether their respective sentences match the given distance query. Any better ideas? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ANTLR and Custom Query Syntax/Parser
Am 29.01.2013 00:24, schrieb Trejkaz: On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin and...@andrewgilmartin.com wrote: When I first started using Lucene, Lucene's Query classes where not suitable for use with the Visitor pattern and so I created my own query class equivalants and other more specialized ones. Lucene's classes might have changed since then (I do not know) On that subject, the infrastructure behind StandardQueryParser is along those lines. Query itself is still not very flexible, but QueryNode is much more convenient and there are processors for walking the tree to do transformations. We ended up using ANTLR to do the syntax parsing for our stuff and then using most of the standard transformations as-is, decorated in some cases (either to customise or to work around bugs.) Of course we had to add our own for all the new features, but we got a fair bit of reuse out of the new framework. Hi, thanks for your hints, everyone! I am still a little bit puzzled about where to start though. The general task is to generate SpanQueries from the tree provided by the ANTLR query syntax parser. The special feature about that query language (that I have not specified and that I cannot change) is that there are binary operators such as /s0 indicating that the payloads of two tokens have to be identical and implying an AND. The query A /s0 B means find documents that contain A AND B where A and B have identical payloads. My intuitive solution would be to make a filter from a BooleanQuery with A AND B and apply that filter in two separate SpanTermQuerys for A and for B respectively. Then, I would perform an intersection on the hits based on the payloads. However, I am still puzzled how to approach this coming from a an Antlr-generated tree. This may be due to a certain lack of routine dealing with Antlr output, but when the parser returns an object of some subclass of RuleReturnScope, how would I be able to derive appropriate Lucene Query subclasses? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0 scalability and performance.
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com: This means that we need to index millions of document with TeraBytes of content and search in it. For now we want to define only one indexed field, contained the content of the documents, with possibility to search terms and retrieving the terms offsets. Does somebody already tested Lucene with TerabBytes of data? Does Lucene has some known limitations related to the indexed documents number or to the indexed documents size? What is about search performance in huge set of data? Hi Vitali, we've been working on a linguistic search engine based on Lucene 4.0 and have performed a few tests with large text corpora. There are at least some overlaps in the functionality you mentioned (term offsets). See http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5). Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Match intersection by Payload
Hi, I have a search scenario in which I search for multiple terms and retain only those matches that share a common payload. I'm using this to search for multiple terms that occur all in one sentence; I've stored a sentence ID in the payload for each token. So far, I've done so by specifying a list of terms, create a BooleanQuery that connects these terms (as in [house, car]) with Occur.MUST. That BooleanQuery is wrapped into a filter. In the next step, I perform a separate SpanQuery for each of the terms (one for house and one for car), using the previously created filter's DocIdSet to restrict the search to documents that contain all of the terms, e.g. for house: SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper(new RegexpQuery(house)).rewrite(reader); The resulting spans are stored in a map with the terms as keys and the matching Spans as values. Finally, I retain only those matches that have the same payload (=sentence) in the same document. This works well for ordinary terms and is reasonably fast since the SpanQuerys are typically restricted to a manageable document set. However, I would prefer to use the Lucene query language rather than specifying a static list of terms, especially because I'd like to have features such as regular expressions, wildcards, ranges etc. However, this makes the above solution impossible because the QueryParser can evaluate what is meant to be one term (e.g. hous*) to multiple ones (house, houses). Then, the intersection as described above does not make sense any longer: I don't want sentences that contain both house and houses, but sentences that contain either one and car too. I have three potential solutions in my mind: 1. Track back the terms generated by a rewritten MultiTermQuery I could try to figure out automatically whether the terms retrieved from the StandardQueryParser should be unionised (as they are derieved from the same term (as in hous*) or intersected (as hous* and car). I'm not sure how to do that reliably though because the single terms are extracted only after generating a Query through a StandardQueryParser and thus there is no distinction between these terms. 2. Implement my own QueryParser that makes distinguishes between terms that are derived from one regex (hous*) and those that are derived from another (car). In that case, the scenario from 1. with unions and intersections would be easy, logically at least. 3. Use a PayloadTermQuery. In that case, I'd hope to throw away the apparently redundant query generation (one for the filter and one for the SpanQuery and substitute it by a Query that makes matching payloads a pre-condition. I'm not sure how to do that either as I don't know beforehand which payload string to match, it just has to be the same for the different terms. All these ways seem equally promising (and complicated) to me, so would you have some advice which one seems more realistic to lead to an actual solution? Thanks, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean and SpanQuery: different results
Am 13.12.2012 18:00, schrieb Jack Krupansky: Can you provide some examples of terms that don't work and the index token stream they fail on? Make sure that the Analyzer you are using doesn't do any magic on the indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, index terms are analyzing to the same, but unexpected term. Apart from the answer I've already given myself, here's another note about the issue. I've been using WhitespaceAnalyzer for both indexing and query parsing, but apparently, the query parser lowercased by default while WhitespaceAnalyzer did not. Therefore, QueryParser.setLowercaseExpandedTerms(false) is necessary in order to get the same results. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Am 18.12.2012 12:36, schrieb Michael McCandless: On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between tokens (and token spans), as in the coreference links example given by Steven above? I think you'd do something like what SynonymFilter does for multi-token synonyms. Eg a synonym for wireless network - wifi would insert a new token (wifi), overlapped on wireless. Lucene doesn't store the end span, but if this is really important for your use case, you could add a payload to that wifi token that would encode the number of positions that the inserted token spans (2 in this case), and then the information would be present in the index. You'd still need to do something custom at read/search time to decode this end position and do something interesting with it ... Thanks for the pointer! I'm still puzzled whether something there is an optimal way to encode (labelled) relations between tokens or even spans; the latter part would probably lead back to the synonym-like solution. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean and SpanQuery: different results
Am 13.12.2012 18:00, schrieb Jack Krupansky: Can you provide some examples of terms that don't work and the index token stream they fail on? Make sure that the Analyzer you are using doesn't do any magic on the indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, index terms are analyzing to the same, but unexpected term. I've done some further analysis and it turns out that for some reason, the SpanQuery described previously returns matches for the first entry (in 18 existing ones) in the list returned by reader.leaves(). As stated in my first post in this thread, my code builds a SpanQuery for each AtomicReaderContext in a list retrieved through MultiReader.leaves(). That SpanQuery is identical to a BooleanQuery with TermQueries for the exactly same terms performed with IndexSearcher.search() on that same MultiReader. The document ids of the hits found through the SpanQuery correspond to the ones returned by the BooleanQuery for the same term. However, the documents returned by the BooleanQuery that do not lye within the first AtomicReaderContext are not found by the SpanQuery. Might this have to do with the docbase? I collect the document IDs from the BooleanQuery through a Collector, adding the actual ID to the current AtomicReaderContext.docbase. In the corresponding SpanQuery, I pass these document IDs as a DocIdBitSet as an argument to SpanQuery.getSpans(). Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean and SpanQuery: different results
Am 17.12.2012 11:54, schrieb Carsten Schnober: Might this have to do with the docbase? I collect the document IDs from the BooleanQuery through a Collector, adding the actual ID to the current AtomicReaderContext.docbase. In the corresponding SpanQuery, I pass these document IDs as a DocIdBitSet as an argument to SpanQuery.getSpans(). Answering my own question that has made me think about the document base issue: indeed, I should be collecting document IDs relative to their respective AtomicReaderContext rather than adding the context's docbase because the subsequent SpanQuery is performed within an AtomicReaderContext as well. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Am 13.12.2012 12:27, schrieb Michael McCandless: For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between tokens (and token spans), as in the coreference links example given by Steven above? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Boolean and SpanQuery: different results
Hi, I'm following Grant's advice on how to combine BooleanQuery and SpanQuery (http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E). The strategy is to perform a BooleanQuery, get the document ID set and perform a SpanQuery restricted by those documents. The purpose is that I need to retrieve Spans for different terms in order to extract their respective payloads separately, but a precondition is that possibly multiple terms occur within the documents. My code looks like this: /* reader and terms are class variables and have been declared finally before */ Reader reader = ...; ListString terms = ... /* perform the BooleanQuery and store the document IDs in a BitSet */ BitSet bits = new BitSet(reader.maxDoc()); AllDocCollector collector = new AllDocCollector BooleanQuery bq = new BooleanQuery(); for (String term : terms) bq.add(new org.apache.lucene.search.RegexpQuery(new Term(config.getFieldname(), term)), Occur.MUST); IndexSearcher searcher = new IndexSearcher(reader); for (ScoreDoc doc : collector.getHits()) bits.set(doc.doc); /* get the spans for each term separately */ for (String term : terms) { String payloads = retrieveSpans(term, bits); // process and print payloads for term ... } def String retrieveSpans(String term, BitSet bits) { StringBuilder payloads = new StringBuilder(); MapTerm, TermContext termContexts = new HashMap(); Spans spans; SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper(new RegexpQuery(new Term(text, term))).rewrite(reader); for (AtomicReaderContext atomic : reader.leaves()) { spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts); while (luceneSpans.next()) { // extract and store payloads in 'payloads' StringBuilder } } return payloads.toString(); } This construction seemed to be working fine at first, but I noticed a disturbing behaviour: for many terms, the BooleanQuery when fed with one RegexpQuery only matches a larger number of documents than the SpanQuery constructed from the same RegexpQuery. With the BooleanQuery containing only one RegexpQuery, the number should be identical, while with multiple Queries added to the BooleanQuery, the SpanQuery should return an equal number or more results. This behaviour is reproducible reliably even after re-indexing, but not for all tokens. Does anyone have an explanation for that? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Boolean and SpanQuery: different results
Am 13.12.2012 18:00, schrieb Jack Krupansky: Can you provide some examples of terms that don't work and the index token stream they fail on? The index I'm testing with is German Wikipedia and I've been testing with different (arbitrarily chosen) terms. I'm listing some results, the first number is the number of documents matched with a BooleanQuery, the second number is the number of documents matches with a SpanQuery: - Knacklaut 24/19 - schönes 70/70 - zufällige 71/70 - wunderbar 24/24 - Himmel773/753 - Sonne 1190/1152 Make sure that the Analyzer you are using doesn't do any magic on the indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, index terms are analyzing to the same, but unexpected term. I'm using a custom Analyzer during indexing. Regarding the analyzer applied during search, I'm not sure: as I haven't defined any specific one, what does Lucene choose? I wasn't thinking about that because I assumed that this should make no difference regarding the BooleanQuery vs. SpanQuery issue. Thanks for the hint anyway, I'll have a closer look there. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SpanQuery and Bits
Hi, I have a problem understanding and applying the BitSets concept in Lucene 4.0. Unfortunately, there does not seem to be a lot of documentation about the topic. The general task is to extract Spans matching a SpanQuery which works with the following snippet: for (AtomicReaderContext atomic : reader.getContext().leaves()) { Spans spans = query.getSpans(atomic, new Bits.MatchAllBits(0), termContexts); while (spans.next()) { // extract payloads etc. } } I understand that the acceptDocs parameter in SpanQuery.getSpans() restricts the search to a set of documents. In the example given above, it searches all documents (Bits.MatchAllBits), right? What I would like to do is generate a Bits object that is based on a BooleanQuery beforehand in order to restrict the search through getSpans() to a set of documents that contain certain terms. I also have a MultiReader object that handles multiple indexes. My intuitive approach would be to apply a QueryWrapperFilter like this: MultiReader reader = ... BooleanQuery bq = ... DocIdSet bitset = ???; Filter filter = new QueryWrapperFilter(bq); for (AtomicReaderContext context = reader.getContext().leaves()) { filter.getDocIdSet(context, new Bits.MatchAllBits(0)) } The obvious question is: how do I handle the context bitsets returned by getDocIdSet() correctly so that I can pass the 'bitset' variable to the getSpans() call? Or am I on the wrong path for this kind of problem? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Specialized Analyzer for names
Hi, I'm indexing names in a dedicated Lucene field and I wonder which analyzer to use for that purpose. Typically, the names are in the format John Smith, so the WhitespaceAnalyzer is likely the best in most cases. The field type to choose seems to be the TextField. Or, would you rather recommend using the KeywordAnalyzer? I'm a bit cautious about that because I'm afraid of wildcard or regex queries such as *Smith or .*Smith respectively. However, there might also be special cases and spelling exceptions of all kinds, e.g. Smith, John, John 'Hammmer' Smith, Abd al-Aziz, Stan van Hoop and what else one could imagine. Is there a special Analyzer that is optimized on dealing with such cases or do I have to do normalization beforehand? I see that such special characters and spellings can easily be covered by the right queries, but that requires the user to know the exact spelling, which is what I'm trying to spare her. Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Potential Resource Leak warning in Analyer.createComponents()
Hi, I use a custom analyzer and tokenizer. The analyzer is very basic and it merely comprises the method createComponents(): - @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { return new TokenStreamComponents(new KoraTokenizer(reader)); } - Eclipse gives me a warning though potential resource leak because the tokenizer is never closed. This is clearly true but is it not desirable either, is it? To get rid of the warning, I had experimentally changed the method to this: Tokenizer source = new KoraTokenizer(reader); TokenStreamComponents ts = new TokenStreamComponents(source); source.close(); return ts; This yields what I had expected, namely a null TokenStream during analysis. So regarding the results, I think the initial version is right, but I am suspicious against the resource leak warning. How serious is it? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenStreamComponents in Lucene 4.0
Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi, However, after switching to Lucene 4 and TokenStreamComponents, I'm getting a strange behaviour: only the first document in the collection is tokenized properly. The others do appear in the index, but un-tokenized, although I have tried not to change anything in the logic. The Analyzer now has this createComponents() method calling the custom TokenStreamComponents class with my custom Tokenizer: After some debugging, it turns out that the Analyer method createComponents() is called only once, for the first document. This seems to be the problem, the other documents are just not analyzed. Here's the loop that creates the fields and supposedly calls the analyzer. Does anyone have a hint why this does only happend for the first document; the loop itself runs once for every document though: --- Listde.ids_mannheim.korap.main.Document documents; Version lucene_version = Version.LUCENE_40; Analyzer analyzer = new KoraAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new IndexWriter(dir, config); [...] for (de.ids_mannheim.korap.main.Document doc : documents) { luceneDocument = new Document(); /* Store document name/ID */ Field idField = new StringField(titleFieldName, doc.getDocid(), Field.Store.YES); /* Store tokens */ String layerFile = layer.getFile(); Field textFieldAnalyzed = new TextField(textFieldName, layerFile, Field.Store.YES); luceneDocument.add(textFieldAnalyzed); luceneDocument.add(idField); try { writer.addDocument(luceneDocument); } catch (IOException e) { jlog.error(Error adding document +doc.getDocid()+:\n+e.getLocalizedMessage()); } } [...] writer.close(); --- The class de.ids_mannheim.korap.main.Document defines our own document objects from which the relevant information can be read as shown in the loop. The list 'documents' is filled in in intermediately called method. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenStreamComponents in Lucene 4.0
Am 20.11.2012 10:22, schrieb Uwe Schindler: Hi, The createComponents() method of Analyzers is only called *once* for each thread and the Tokenstream is *reused* for later documents. The Analyzer will call the final method Tokenizer#setReader() to notify the Tokenizer of a new Reader (this method will update the protected input field in the Tokenizer base class) and then it will reset() the whole tokenization chain. The custom TokenStream components must initialize themselves with the new settings on the reset() method. Thanks, Uwe! I think what changed in comparison to Lucene 3.6 is that reset() is called upon initialization, too, instead of after processing the first document only, right? Apart from the fact that it used not to be obligatory to make all components reuseable, I suppose. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
TokenStreamComponents in Lucene 4.0
Hi, I have recently updated to Lucene 4.0, but having problems with my custom Analyzer/Tokenizer. In the days of Lucene 3.6, it would work like this: 0. define constants lucene_version and indexdir 1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer) 2. create an IndexWriterConfiguration: config = new IndexWriterConfig(lucene_version, analyzer) 3. create an IndexWriter writer = (indexdir, config) 4. for each document: 4.1. create a Document: Document doc = new Document() 4.2. create a Field: Field field = new Field(text, layerFile, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS); 4.3. add field to document: doc.add(field) 4.4. add document to writer: writer.add(doc) 5. close the writer (write to disk) However, after switching to Lucene 4 and TokenStreamComponents, I'm getting a strange behaviour: only the first document in the collection is tokenized properly. The others do appear in the index, but un-tokenized, although I have tried not to change anything in the logic. The Analyzer now has this createComponents() method calling the custom TokenStreamComponents class with my custom Tokenizer: @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new KoraTokenizer(reader); final TokenStreamComponents tokenstream = new KoraTokenStreamComponents(source); try { source.close(); } catch (IOException e) { jlog.error(e.getLocalizedMessage()); e.printStackTrace(); } return tokenstream; } The custom TokenStreamComponents class uses this constructor: public KoraTokenStreamComponents(Tokenizer tokenizer) { super(tokenizer); try { tokenizer.reset(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Since I have not changed anything in the Tokenizer, I suspect the error to be in the new class KoraTokenStreamComponents. This may be due to the fact that I do not fully understand why the TokenStreamComponents class has been introduced. Any hints on that? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenStreamComponents in Lucene 4.0
Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi again, just a little update: However, after switching to Lucene 4 and TokenStreamComponents, I'm getting a strange behaviour: only the first document in the collection is tokenized properly. The others do appear in the index, but un-tokenized, although I have tried not to change anything in the logic. The Analyzer now has this createComponents() method calling the custom TokenStreamComponents class with my custom Tokenizer: @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new KoraTokenizer(reader); final TokenStreamComponents tokenstream = new KoraTokenStreamComponents(source); try { source.close(); } catch (IOException e) { jlog.error(e.getLocalizedMessage()); e.printStackTrace(); } return tokenstream; } When using the packaged Analyzer.TokenStreamComponents class instead of my custom KoraTokenStreamComponents class, the behaviour does not seem to change: - final TokenStreamComponents tokenstream = new KoraTokenStreamComponents(source); + final TokenStreamComponents tokenstream = new TokenStreamComponents(source); Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: SpanQuery, Filter, BooleanQuery
Am 29.10.2012 13:40, schrieb Carsten Schnober: Now, I'd like to add the option to filter the resulting Spans object by another WildcardQuery on a different field that contains document titles. My intuitive approach would have been to use a filter like this: I'd like to conclude my previous post in a less elaborate way: I need to a) combine two WildcardQueries so that I can still use SpanMultiTermQueryWrapper to generate a SpanQuery. b) apply a filter to a WildcardQuery so that the WildcardQuery's results are reduced before converting it to a SpanQuery using SpanMultiTermQueryWrapper. Answer b) seems intuitively the way to go there, but I don't quite find the correct path there because the filter does not work as intended (see my previous post). Answer a) does not seem feasible here either because SpanMultiTermQueryWrapper requires a MultiTermQuery, but not a BooleanQuery. Any hints on that? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
SpanQuery, Filter, BooleanQuery
Hi, I've got a setup in which I would like to perform an arbitrary query over one field (typically realised through a WildcardQuery) and the matches are returned as a SpanQuery because the result payloads are further processed using Span.next() and Span.getPayload(). This works fine with the following code (extract), using Lucene 4.0.0: - // these fields are initialized externally through public methods: private final MultiReader reader; private final String termString; private final String fieldname; private final int maxHits; private MapTerm, TermContext termContexts = new HashMap(); WildcardQuery wildcard; Term term = new Term(fieldname, termString); SpanQuery query;// Lucene query Spans luceneSpans; wildcard = new WildcardQuery(term); query = (SpanQuery) new SpanMultiTermQueryWrapper(wildcard).rewrite(reader); spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts); for (AtomicReaderContext atomic : reader.getContext().leaves()) { spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts); while (luceneSpans.next() total = maxHits) { ... } } - Now, I'd like to add the option to filter the resulting Spans object by another WildcardQuery on a different field that contains document titles. My intuitive approach would have been to use a filter like this: Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term(titlefield, titles))); The filter is applied in a dedicated method with this line: DocIdSet matchingTitleIDs = filter.getDocIdSet(context, new Bits.MatchAllBits(0)); And subsequently, the getSpan() call from above is substituted by: spans = query.getSpans(atomic, matchingTitleIDs.bits(), termContexts); However, this yields either a NullPointerException when there are no hits or does not affect the results at all in comparison to no filtering. I've come across the thread lucene-4.0: QueryWrapperFilter docBase [1] in which Uwe suggests not to use QueryWrapperFilter, but to use another Query and to combine it using a Boolean Query in such a scenario, if I understand correctly. Does this still claim for Lucene 4.0? However, I am not sure how to use a BooleanQuery here because I need the SpanQuery result. Any thoughts about what I'm doing wrong and how to fix this? Thank you very much! Carsten [1] http://mail-archives.apache.org/mod_mbox/lucene-java-user/201210.mbox/%3CCABY_-Z7r=z0301yf1-1uvbqyw3jf48srpuhe6syt1eh28vn...@mail.gmail.com%3E -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene in Corpus Linguistics
Hi, in case someone is interested in an application of the Lucene indexing engine in the field of corpus linguistics rather than information retrieval: we have worked on that subject for some time and have recently published a conference paper about it: http://korap.ids-mannheim.de/2012/09/konvens-proceedings-online/ Central issues addressed in this work have been to externally produced and concurrent tokenizations as well as multiple linguistic annotations on different levels. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
UnsupportedOperationException: Query should have been rewritten
Dear list, I am trying to combine a WildcardQuery and a SpanQuery because I need to extract spans from the index for further processing. I realise that there have been a few public discussions about this topic around, but I still fail to get what I am missing here. My code is this (Lucene 3.6.0): == WildcardQuery wildcard = new WildcardQuery(new Term(field, bro*)); SpanQuery query = new SpanMultiTermQueryWrapperWildcardQuery(wildcard); // query = query.rewrite(reader); Spans luceneSpans = query.getSpans(reader); == This throws the following exception: == Exception in thread main java.lang.UnsupportedOperationException: Query should have been rewritten at org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.getSpans(SpanMultiTermQueryWrapper.java:114) == I am basically aware of the problem that I cannot apply a MultiTermQuery instance (like a WildcardQuery) without calling rewrite(), but on the other hand, rewrite() returns a Query object that I cannot use as a SpanQuery instance. I'm almost sure that there is a reasonable solution for this problem that I am not able to spot. Or do I have to migrate either to Lucene 4 or use a SpanRegexQuery instead which I do not really want to because it is marked as deprecated. Thank you very much! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: UnsupportedOperationException: Query should have been rewritten
Am 14.08.2012 11:00, schrieb Uwe Schindler: You have to rewrite the wrapper query. Thanks, Uwe! I had tried that way but it failed because the rewrite() method would return a Query (not a SpanQuery) object. A cast seems to solve the problem, I'm re-posting the code snippet to the list for the sake of completeness: WildcardQuery wildcard = new WildcardQuery(new Term(field, bro*)); SpanQuery query = (SpanQuery) new SpanMultiTermQueryWrapperWildcardQuery(wildcard).rewrite(reader); Spans spans = query.getSpans(reader); All I am still wondering about is whether this cast is totally safe, i.e. robust to all kinds of variable search terms. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike, There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 from the Hathi Trust. The upshot in a nutshell was that queries including terms with very large postings lists (ie high occurrences) were slow, and the approach they took to dealing with this was to index n-grams (ie pairs and triplets of adjacent tokens). However I'm not sure this would help much if your queries will typically include only a single token. This is very interesting for our use case indeed. However, you are right that indexing n-grams is not (per sé) a solution for my given problem because I'm working on an application using multiple indexes. A query for one isolated frequent term will indeed be rare presumably, or at least rare enough to tolerate slow response times, but the results will typically be intersected with results from other indexes. To illustrate this more practically: the index I described having relatively few distinct and partially extremely frequent tokens indexes part-of-speech (POS) tags with positional information stored in the payload. A parallel index indexes actual text; a typical query may look for a certain POS tag in one index and a word X at the same position with a matching payload in the other index. So both indexes need to be queries completely before the intersection can be performed. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 07.08.2012 10:20, schrieb Danil ŢORIN: Hi Danil, If you do intersection (not join), maybe it make sense to put every thing into 1 index? Just a note on that: my application performs intersections and joins (unions) on the results, depending on the query. So the index structure has to be ready for both, but intersections are clearly more complicated. Just transform your input like brown fox into ADJ:brown|your payload NOUN:fox|other payload I understand that this denotes ADJ and NOUN to be interpreted as the actual token and brown and fox as payloads (followed by other payload), right? This is a very neat approach and I have vaguely considered that. One problem is that I aim for a very high level of flexibility, meaning that additional annotations have to be addable at any point and different tokenizations apply. However, I will re-consider your suggestion, possibly applying one of multiple tokenizations as a default in this sense. Of course I'm not aware of all the details, so my solution might not be applicable to your project. Maybe you could share more details, so this won't transform in XY problem. Keep in mind : always optimize your index for the query usecase, instead of blindly processing the input data. Thanks for that reminder; this becomes quite difficult in my scenario though since we want to allow for flexible changes in the index types, representing different annotations, tokenization logics etc. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Hi Danil, Just transform your input like brown fox into ADJ:brown|your payload NOUN:fox|other payload I understand that this denotes ADJ and NOUN to be interpreted as the actual token and brown and fox as payloads (followed by other payload), right? Sorry for replying to myself, but I've realised only now that you probably meant to replace the full token string (brown) by ADJ:brown and use the payload otherwise, right? Regarding incoming queries, this method makes it necessary to perform a Wildcard query (e.g. NOUN:*) when I am not interested in the actual text (brown) -- which may happen more or less frequently -- am I right? However, this might be an acceptable trade-off... Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 31.07.2012 12:10, schrieb Ian Lea: Hi Ian, Lucene 4.0 allows you to use custom codecs and there may be one that would be better for this sort of data, or you could write one. In your tests is it the searching that is slow or are you reading lots of data for lots of docs? The latter is always likely to be slow. General performance advice as in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be relevant. SSDs and loads of RAM never hurt. You are very right, therer are many results from many docs for the slower searches performed on that index. However, I am still wondering about the theoretical implications: having a small vocabulary with many tokens in an inverted index would yield a rather long list of occurrences for some/many/all (depending on the actual distribution) of the search terms. Thanks for your pointer to the codecs in Lucene 4, I suppose that this will be the actual point to attack for that scenario. It may be a silly question, but one that might be of interest for the whole community ;-) : can someone point me to an in-depth documentation of Lucene 4 codecs, ideally covering both theoretical backgrounds and implementation? There are numerous helpful blog entries, presentations, etc. available on the net, but in case there is some central instance, I have not been able to find it anyway. Thanks! Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Small Vocabulary
Dear list, I'm considering to use Lucene for indexing sequences of part-of-speech (POS) tags instead of words; for those who don't know, POS tags are linguistically motivated labels that are assigned to tokens (words) to describe its morpho-syntactic function. Instead of sequences of words, I would like to index sequences of tags, for instance ART ADV ADJA NN. The aim is to be able to search (efficiently) for occurrences of ADJA. The question is whether Lucene can be applied to deal with that data cleverly because the statistical properties of such pseudo-texts is very distinct from natural language texts and make me wonder whether Lucene's inverted indexes are suitable. Especially the small vocabulary size (50 distinct tokens, depending on the tagging system) is problematic, I suppose. First trials for which I have implemented an analyzer that just outputs Lucene tokens such as ART, ADV, ADJA, etc. yield results that are not exactly perfect regarding search performance, in a test corpus with a few million tokens. The number of tokens in production mode is expected to be much larger, so I wonder whether this approach is promising at all. Does Lucene (4.0?) provide optimization techniques for extremely small vocabulary sizes? Thank you very much, Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Offsets in 3.6/4.0
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de: Dear Karsten, abstract of your post: you need the offset to perform your search/ranking like the position is needed for phrase queries. You are using reader.getTermFreqVector to get the offset. This is to slow for your application and you think about a switch to version 4.0 Yes, that's about it. imho you should using payloads. You also could switch to version 4 because in version you can store the offset to each term like the position in version 3x. But this is basically the same as the use of payloads: * http://lucene.apache.org/core/3_6_0/fileformats.html#Positions * http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Positions I now use payloads and this fulfils my functional requirements. I was hoping to avoid that because I am also storing other information in the Payload which makes it feel a bit messy; especially as it seemed sensible to me to actually make use of the Offsets field as it already exists. Anyway, the problem is solved so far, thank you very much! I still wonder what the purpose of the Offset field is as it is so inefficient to access. It seems like a wasteful redundancy to even store the Offsets during indexing, considering that I also store it as a payload. Or am I missing something? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Offsets in 3.6/4.0
Dear list, I am working on a search application that depends on retrieving offsets for each match. Currently (in Lucene 3.6), this seems to be overly costly, at least in my solution that looks like this: --- TermPositionVector tfv; int index; TermVectorOffsetInfo[] offsets; tfv = (TermPositionVector) reader.getTermFreqVector(docid, fieldname); index = tfv.indexOf(term.text()); offsets = tfv.getOffsets(index); --- So I can user the suitable TermVectorOffsetInfo from the offsets[] array to retrieve the offset information of a span. However, this slows down the search to an unacceptable level. Reviewing the thread 'Retrieving Offsets' (http://lucene.472066.n3.nabble.com/Retrieving-offsets-td3658238.html) indicates that there has not been any more efficient way to go in Lucene 3.6. Am I right? However, I understand that the patch LUCENE-3684 (https://issues.apache.org/jira/browse/LUCENE-3684) has improved the situation. I am wondering now whether this is worth migrating to Lucene 4.0 in terms of search performance. It is currently not entirely clear to me, whether Lucene 4.0 alpha actually allows the retrieval of offsets from an index without having to read the TermFreqVector though. Who can give me some advise about the potential search performance gain for such an application and ideally to some pointers about how to resolve the problem? Thank you very much, Carsten Schnober -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Field value vs TokenStream
Am 18.04.2012 20:06, schrieb Uwe Schindler: Hi, You should inform yourself about the difference between stored and indexed fields: The tokens in the .tis file are in fact the analyzed tokens retrieved from the TokenStream. This is controlled by the Field parameter Field.Index. The Field.Store parameter has nothing to do with indexing: if a field is marked as stored, the full and unchanged string / binary is stored in the stored fields file (.fdt). Stored fields are used Thanks for that clarification! Best, Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Field value vs TokenStream
Dear list, I'm studying the Lucene index file formats and I wonder: after having initialized a field with Field(String name, String value, Field.Store store, Field.Index index), where is the value String stored? I understand that the chosen analyzer does its processing on that value, including tokenization, and returns a TokenStream from which the Indexer retrieves the attributes that it stores in the index. When I use a binary editor to inspect the term infos (tis) file in the index directory, I can see every single token (term). For experimenting purposes, I implemented an analyzer that converts the value input to the field and noticed the following: the TokenStream still correctly generates the terms that end up to be stored in the tis file, but the initial input value is still displayed as the field value when I retrieve a document from the index and output it with Document.toString(). I tried to analyse the Field's tokenStream, but tokenStreamValue() returns null; is that normal when retrieving a document from an existing index? Can someone let me know what happens to a Field's value string and at which point in the pipeline it is replaced by the (term) attributes generated by the TokenStream? Thank you very much! Best, Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Indexing Pre-analyzed Field
Hi, I've been wondering about best way to index a pre-analyzed field. With pre-analyzed, I mean essentially one that I'd like to initialize with the constructor Field(String name, TokenStream tokenStream). There is a loop about a bunch of document, all with pre-defined tokenizations that are stored in the variable tokenizations. One by one, the Lucene documents are added to the writer. The writer is an IndexWriter object that has been initialized and configured before. I have implemented a custom TokenStream class for that purpose, so I've approached the problem like the following: CustomTokenStream ts = new CustomTokenStream(); for (tokenization : tokenizations) { idField = new Field(id, doc.getDocid(), Field.Store.YES, Field.Index.NOT_ANALYZED); ts.setTokenization(tokenization); textField = new Field(text, ts); luceneDocument.add(idField); luceneDocument.add(textField); try { writer.addDocument(luceneDocument); } catch (IOException e) { System.err.println(Error adding document:\n+e.getLocalizedMessage()); } } The problem is clearly that I cannot query the text field, can I? I've tried other ways though like initializing the text field with textField = new Field(String name, String value, Field.Store.YES, Field.Index.ANALYZED) and setting textField.setTokenStream(ts); However, this does not seem to make sense since I don't want to use a Lucene built-in analyzer and I'm not quite clear about what I should use for the value in the latter approach. Any help is very welcome! Thank you very much! Best regards, Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Apply custom tokenization
Dear list, I have a quite specific issue on which I would appreciate very much having some thoughts before I start the actual implementation. Here's my task description: I would like to index corpora that have already been tokenized by an external tokenizer. This tokenization is stored in an external file and is the one I want to use for the Lucene index too. For each document, there is a file that describes each token in the document by character offsets, e.g. token start=0 end=3 /. Leave aside the XML format, I'll write an appropriate XML parser so that we just have that tokenization information. I do not want do to any additional analysis on the input text, i.e. no stopword filtering etc.; each token that is specified in the external tokenization is supposed to result in an indexed token. My approach to achieve this goal would be to implement an Analyzer that reads the external tokenization information and generates a TokenStream containing all the Token objects with offsets set according to the external tokenization, i.e. without an own Tokenizer implementation. I'm working with Lucene 3.5, which is why one very concrete question at this point is: how would you implement this using the Attribute interface; still use Token objects or can/should I work around them at all? The documentation is quite vague about that point and so is the Lucene in Action (2nd ed.) textbook. The background is that I need to allow different tokenizations, so there will potentially be multiple indexes for a text. Queries will have to be tokenized by a user-defined tokenizer and the suitable index will then be searched. So what are your thoughts about that approach? Is it the right strategy for the task? Please recall that a given fact is that the tokenization has to be read from an external file. In general, I am afraid that the Lucene almost hardwires the analysis process. Even though it does allow for custom tokenizers to be implemented, it does not seem to intended that one does come up with a completely self-made text analysis process, is it? Thank you very much! Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://www.ids-mannheim.de/kl/projekte/korap/ Tel.: +49-(0)621-1581-238 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org