Re: Search across a specified number of boundaries

2013-01-15 Thread Mike Ree
Mikhail,

Yeah, I considered that originally, but then after analyzing the data
noticed that was not possible. Some of the content we analyze contains
large tables that after ocr get turned into long running sentences which
contain 500k+ words per a sentence. Overall there are probably around 10k
of those anomalies that stop the ranges from working as we run out of
positions with the max value an integer can contain and run the risk of a
future document breaking it.

I found a Jira on what I'm looking for. Going to look into it and see if I
can get it to work for my situation.

https://issues.apache.org/jira/browse/LUCENE-777

Thanks for the help.

Mike

On Mon, Jan 14, 2013 at 11:48 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Mike,

 When Lucene's Analyser indexes the text it adds positions into the index
 which are lately used by SpanQueries. Have you considered idea of position
 increment gap? e.g. the first sentence is indexed with words positions:
 0,1,2,3,... the second sentence with 100,101,102,103,..., third
 200,201,202.. Then applying some span constraint allows you search
 across/inside of the sentences.
 WDYT?


 On Sun, Jan 6, 2013 at 6:50 PM, Erick Erickson erickerick...@gmail.comwrote:

 Mike:

 I'm _really_ stretching here, but you might be able to do something
 interesting
  with payloads. Say each word had a payload with the sentence number and
 you _somehow_ made use of that information in a custom scorer. But like I
 said, I really have no good idea how to accomplish that...

 BTW, in future this kind of question is better asked on the user's list
 (either
 Lucene or Solr), this list if intended for discussing development work

 Best
 Erick


 On Fri, Jan 4, 2013 at 1:02 PM, Mike Ree mike.ad...@olytech.net wrote:

 d terms that are in nearby sentences.

 IE:
 TermA NEAR3 TermB would find all TermA's that are within 3 sentences
 of TermB.

 Have found ways to find TermA within same sentence





 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Search across a specified number of boundaries

2013-01-14 Thread Mikhail Khludnev
Mike,

When Lucene's Analyser indexes the text it adds positions into the index
which are lately used by SpanQueries. Have you considered idea of position
increment gap? e.g. the first sentence is indexed with words positions:
0,1,2,3,... the second sentence with 100,101,102,103,..., third
200,201,202.. Then applying some span constraint allows you search
across/inside of the sentences.
WDYT?


On Sun, Jan 6, 2013 at 6:50 PM, Erick Erickson erickerick...@gmail.comwrote:

 Mike:

 I'm _really_ stretching here, but you might be able to do something
 interesting
  with payloads. Say each word had a payload with the sentence number and
 you _somehow_ made use of that information in a custom scorer. But like I
 said, I really have no good idea how to accomplish that...

 BTW, in future this kind of question is better asked on the user's list
 (either
 Lucene or Solr), this list if intended for discussing development work

 Best
 Erick


 On Fri, Jan 4, 2013 at 1:02 PM, Mike Ree mike.ad...@olytech.net wrote:

 d terms that are in nearby sentences.

 IE:
 TermA NEAR3 TermB would find all TermA's that are within 3 sentences of
 TermB.

 Have found ways to find TermA within same sentence





-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com