Re: Text dependent analyzer

2015-04-15 Thread Shay Hummel
Hi Ahment,
Thank you for the reply,
That's exactly what I am doing. At the moment, to index a document, I break
it to sentences, and each sentence is analyzed (lemmatizing, stopword
removal etc.)
Now, what I am looking for is a way to create an analyzer (a class which
extends lucene's analyzer). This analyzer will be used for index and query
processing. It (a like the english analyzer) will receive the text and
produce tokens.
The Api of Analyzer requires implementing the createComponents which
is not dependent
on the text being analyzed. This fact is problematic since as you know the
OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
model files to provide spans of each sentence and then break them).
Is there a way around it?

Shay

On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan 
wrote:

> Hi Hummel,
>
> You can perform sentence detection outside of the solr, using opennlp for
> instance, and then feed them to solr.
>
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
>
> Ahmet
>
>
>
>
> On Tuesday, April 14, 2015 8:12 PM, Shay Hummel 
> wrote:
> Hi
> I would like to create a text dependent analyzer.
> That is, *given a string*, the analyzer will:
> 1. Read the entire text and break it into sentences.
> 2. Each sentence will then be tokenized, possesive removal, lowercased,
> mark terms and stemmed.
>
> The second part is essentially what happens in english analyzer
> (createComponent). However, this is not dependent of the text it receives -
> which is the first part of what I am trying to do.
>
> So ... How can it be achieved?
>
> Thank you,
>
> Shay Hummel
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Text dependent analyzer

2015-04-15 Thread Jack Krupansky
Currently, how are you indexing sentence boundaries? Are you placing
sentences in distinct fields, leaving a position gap, or... what?

Ultimately it comes down to how you intend to query the data in a way that
respects sentence boundaries. To put it simply, whay exactly do you care
where the sentence boundaries are? Be specific, because that determines
what your queries should look like, which determines what the indexed text
should look like, which determines how the text should be analyzed.

-- Jack Krupansky

On Wed, Apr 15, 2015 at 8:12 AM, Shay Hummel  wrote:

> Hi Ahment,
> Thank you for the reply,
> That's exactly what I am doing. At the moment, to index a document, I break
> it to sentences, and each sentence is analyzed (lemmatizing, stopword
> removal etc.)
> Now, what I am looking for is a way to create an analyzer (a class which
> extends lucene's analyzer). This analyzer will be used for index and query
> processing. It (a like the english analyzer) will receive the text and
> produce tokens.
> The Api of Analyzer requires implementing the createComponents which
> is not dependent
> on the text being analyzed. This fact is problematic since as you know the
> OpenNlp sentence breaking depends on the text it gets (OpenNlp uses the
> model files to provide spans of each sentence and then break them).
> Is there a way around it?
>
> Shay
>
> On Wed, Apr 15, 2015 at 3:50 AM Ahmet Arslan 
> wrote:
>
> > Hi Hummel,
> >
> > You can perform sentence detection outside of the solr, using opennlp for
> > instance, and then feed them to solr.
> >
> >
> https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect
> >
> > Ahmet
> >
> >
> >
> >
> > On Tuesday, April 14, 2015 8:12 PM, Shay Hummel 
> > wrote:
> > Hi
> > I would like to create a text dependent analyzer.
> > That is, *given a string*, the analyzer will:
> > 1. Read the entire text and break it into sentences.
> > 2. Each sentence will then be tokenized, possesive removal, lowercased,
> > mark terms and stemmed.
> >
> > The second part is essentially what happens in english analyzer
> > (createComponent). However, this is not dependent of the text it
> receives -
> > which is the first part of what I am trying to do.
> >
> > So ... How can it be achieved?
> >
> > Thank you,
> >
> > Shay Hummel
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


FieldMaskingSpanQuery and statistics

2015-04-15 Thread Stephen Wu
In the documentation for FieldMaskingSpanQuery, it says:

"Note: as getField() returns the masked field, scoring will be done using the 
Similarity and collection statistics of the field name supplied, but with the 
term statistics of the real field. This may lead to exceptions, poor 
performance, and unexpected scoring behavior."

I assume this was implemented as such because the hypothetical use case was 
with very short fields, and collection statistics/idf are not so important when 
you're basically doing boolean queries.

However, we've given a lot of thought to how we could include linguistic 
annotations alongside the original text, and we're looking at separate fields + 
FieldMaskingSpanQuery to do the trick. (The idea is to create "annotation" 
fields with token offsets set by the tokenized text. Then FieldMaskingSpanQuery 
allows us to search both text and annotations as if they are in the same token 
position in the same field. We've considered payloads, synonyms, and a few 
other things, but not really been satisfied.)

In order for this to be scientifically interesting, though, we need for the 
collection statistics to remain consistent with the original "annotation" 
field; we would also like to ensure that all of these stats/SpanQuery 
descendents work with LMDirichletSimilarity.

Any idea how to implement a FieldMaskingSpanQuery that gets collection 
statistics right?

Many thanks for any help on the issue.

stephen
P.S. Has anyone made progress on allowing indexes to store word lattices, 
preserving the graphs that are produced with TokenFilters?