SortingAtomicReader alternate to Tim-Sort...
We were experimenting with SortingMergePolicy and came across an alternate solution to TimSort of postings-list using FBS GrowableWriter. I have attached relevant code-snippet. It would be nice if someone can clarify whether it is a good idea to implement... public class SortingAtomicReader { … … class SortingDocsEnum { //Last 2 variables namely *newdoclist* *olddocToFreq* are added in //constructor. It is assumed that these 2 variables are init during //merge start they are then re-used till merge completes... public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs, final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter olddocToFreq) throws IOException { …. … while (true) { //Instead of Tim-Sorting as in existing code doc = in.nextDoc(); int newdoc = docMap.oldToNew(doc); newdoclist.set(newdoc); if(withFreqs) { olddocToFreq.set(doc, in.freq()); } } @Override public int nextDoc() throws IOException { if (++docIt = upto) { return NO_MORE_DOCS; } currDoc = newdoclist.nextSetBit(++currDoc); if(currDoc == -1) { return NO_MORE_DOCS; } //clear the set-bit here before returning... newdoclist.clear(currDoc); return currDoc; } @Override public int freq() throws IOException { if(withFreqs docIt upto) { return (int)olddocToFreq.getMutable() .get(docMap.newToOld(currDoc)); } return 1; } }
Span near query with payloads
Hi Why did you add this note in SpanPayloadCheckQuery Do not use this with an SpanQuery that contains a SpanNearQuery http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearQuery.html. Instead, use SpanNearPayloadCheckQuery http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearPayloadCheckQuery.html since it properly handles the fact that payloads aren't ordered by SpanNearQuery http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearQuery.html. ? I used SpanNearQuery and provided, as its clauses[], an array of SpanPayloadCheckQuery and it seems to work OK? What I am trying to do is search terms in a certain window which each term has its own payload. I use the payload to code part of speech. for example: Doc1: I booked a flight. (booked is a verb, flight is a noun) Doc2: I read the book the Hobbit during the flight to London. (book is a noun, flight is a noun) when I searched book flight with (verb and noun as payload respectively) it worked for the correct window (depends If i removed stopwords or not) properly. So what did you mean with this note? Thank you, Shay Hummel
[ANNOUNCE] Apache Lucene 5.1.0 released
14 April 2015 - The Lucene PMC is pleased to announce the release of Apache Lucene 5.1.0 The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0 Lucene 5.1.0 includes 9 new features, 10 bug fixes, and 24 optimizations / other changes from 18 unique contributors. For detailed information about what is included in 5.1.0 release, please see: http://lucene.apache.org/core/5_1_0/changes/Changes.html Enjoy! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Text dependent analyzer
Hi I would like to create a text dependent analyzer. That is, *given a string*, the analyzer will: 1. Read the entire text and break it into sentences. 2. Each sentence will then be tokenized, possesive removal, lowercased, mark terms and stemmed. The second part is essentially what happens in english analyzer (createComponent). However, this is not dependent of the text it receives - which is the first part of what I am trying to do. So ... How can it be achieved? Thank you, Shay Hummel
Re: Text dependent analyzer
Hi Hummel, You can perform sentence detection outside of the solr, using opennlp for instance, and then feed them to solr. https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect Ahmet On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com wrote: Hi I would like to create a text dependent analyzer. That is, *given a string*, the analyzer will: 1. Read the entire text and break it into sentences. 2. Each sentence will then be tokenized, possesive removal, lowercased, mark terms and stemmed. The second part is essentially what happens in english analyzer (createComponent). However, this is not dependent of the text it receives - which is the first part of what I am trying to do. So ... How can it be achieved? Thank you, Shay Hummel - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org