SortingAtomicReader alternate to Tim-Sort...

2015-04-14 Thread Ravikumar Govindarajan
We were experimenting with SortingMergePolicy and came across an alternate
solution to TimSort of postings-list using FBS  GrowableWriter.

I have attached relevant code-snippet. It would be nice if someone can
clarify whether it is a good idea to implement...

public class SortingAtomicReader {
…
…
class SortingDocsEnum {

//Last 2 variables namely *newdoclist*  *olddocToFreq* are added in
//constructor. It is assumed that these 2 variables are init during
//merge start  they are then re-used till merge completes...


public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
olddocToFreq) throws IOException {

….

…

while (true) {

  //Instead of Tim-Sorting as in existing code

  doc = in.nextDoc();

  int newdoc = docMap.oldToNew(doc);

  newdoclist.set(newdoc);

  if(withFreqs) {

olddocToFreq.set(doc, in.freq());

  }

}


@Override

public int nextDoc() throws IOException {

  if (++docIt = upto) {

  return NO_MORE_DOCS;

  }

  currDoc = newdoclist.nextSetBit(++currDoc);

  if(currDoc == -1) {

return NO_MORE_DOCS;

  }

  //clear the set-bit here before returning...

  newdoclist.clear(currDoc);

  return currDoc;

}


@Override

public int freq() throws IOException {

  if(withFreqs  docIt  upto) {

  return (int)olddocToFreq.getMutable()

 .get(docMap.newToOld(currDoc));

  }

  return 1;

}

}


Span near query with payloads

2015-04-14 Thread Shay Hummel
Hi

Why did you add this note in SpanPayloadCheckQuery
Do not use this with an SpanQuery that contains a SpanNearQuery
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearQuery.html.
Instead, use SpanNearPayloadCheckQuery
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearPayloadCheckQuery.html
since
it properly handles the fact that payloads aren't ordered by SpanNearQuery
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/spans/SpanNearQuery.html.
?
I used SpanNearQuery and provided, as its clauses[], an array of
SpanPayloadCheckQuery and it seems to work OK?

What I am trying to do is search terms in a certain window which each term
has its own payload. I use the payload to code part of speech.

for example:
Doc1: I booked a flight. (booked is a verb, flight is a noun)
Doc2: I read the book the Hobbit during the flight to London. (book is a
noun, flight is a noun)

when I searched book flight with (verb and noun as payload respectively)
it worked for the correct window (depends If i removed stopwords or
not) properly.

So what did you mean with this note?

Thank you,

Shay Hummel


[ANNOUNCE] Apache Lucene 5.1.0 released

2015-04-14 Thread Timothy Potter
14 April 2015 - The Lucene PMC is pleased to announce the release of
Apache Lucene 5.1.0

The release is available for immediate download at:
http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0

Lucene 5.1.0 includes 9 new features, 10 bug fixes, and 24 optimizations / other
changes from 18 unique contributors.

For detailed information about what is included in 5.1.0 release,
please see: http://lucene.apache.org/core/5_1_0/changes/Changes.html

Enjoy!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Text dependent analyzer

2015-04-14 Thread Shay Hummel
Hi
I would like to create a text dependent analyzer.
That is, *given a string*, the analyzer will:
1. Read the entire text and break it into sentences.
2. Each sentence will then be tokenized, possesive removal, lowercased,
mark terms and stemmed.

The second part is essentially what happens in english analyzer
(createComponent). However, this is not dependent of the text it receives -
which is the first part of what I am trying to do.

So ... How can it be achieved?

Thank you,

Shay Hummel


Re: Text dependent analyzer

2015-04-14 Thread Ahmet Arslan
Hi Hummel,

You can perform sentence detection outside of the solr, using opennlp for 
instance, and then feed them to solr.
https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect

Ahmet




On Tuesday, April 14, 2015 8:12 PM, Shay Hummel shay.hum...@gmail.com wrote:
Hi
I would like to create a text dependent analyzer.
That is, *given a string*, the analyzer will:
1. Read the entire text and break it into sentences.
2. Each sentence will then be tokenized, possesive removal, lowercased,
mark terms and stemmed.

The second part is essentially what happens in english analyzer
(createComponent). However, this is not dependent of the text it receives -
which is the first part of what I am trying to do.

So ... How can it be achieved?

Thank you,

Shay Hummel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org