Re: Proximity query

2015-02-12 Thread Maisnam Ns
Hi,

I googled it but could not find the jars of these classes can some help me
where to get the jars

import org.apache.lucene.corpus.stats.IDFCalc;
import org.apache.lucene.corpus.stats.TFIDFPriorityQueue;
import org.apache.lucene.corpus.stats.TermIDF;

Thanks

On Thu, Feb 12, 2015 at 11:01 PM, Maisnam Ns maisnam...@gmail.com wrote:

 Hi Allison and Sujit,

 Thanks so much for your links I am so happy I am looking at  exactly the
 links that almost covers my use case.

 Allison, sure will get back to you if I have some more questions.

 Regards
 NS





 On Thu, Feb 12, 2015 at 10:49 PM, Sujit Pal sujit@comcast.net wrote:

 I did something like this sometime back. The objective was to find
 patterns
 surrounding some keywords of interest so I could find keywords similar to
 the ones I was looking for, sort of like a poor man's word2vec. It uses
 SpanQuery as Jigar said, and you can find the code here (I believe it was
 written against Lucene 3.x so you may have to upgrade it if you are using
 Lucene 4.x):


 http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html

 -sujit


 On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote:

  Hi Shah,
 
  Thanks for your reply. Will try to google SpanQuery meanwhile if you
 have
  some links can you please share
 
  Thanks
 
  On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com
  wrote:
 
   This concept is called Proximity Search in general.
  
   In Lucene they are achieved using SpanQuery.
  
   On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com
  wrote:
  
Hi,
   
Can someone help me if this use case is possible or not with lucene
   
Use case: I have a string say 'Japan' appearing in 10 documents and
 I
   want
to get back , say some results which contain two words before
 'Japan'
  and
two words after 'Japan' may be something like this ' Economy of
 Japan
  is
growing' etc.
   
 If it is not possible where should I look for such queries
   
Thanks
   
  
 





Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
Based on reading the same comments you read, I'm pretty doubtful that
Codec.getDefault() is going to work. It seems to me that this
situation renders the FilterCodec a bit hard to to use, at least given
the 'every release deprecates a codec' sort of pattern.



On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 How about Codec.getDefault()? It does indeed not necessarily return the 
 newest one (if somebody changes the default using Codec.setDefault()), but 
 for your use case wrapping the current default one, it should be fine?

 I have not tried this yet, but there might be a chicken-egg problem:
 - Your codec will have a separate name and be listed in META-INF as service 
 (I assume this). So it gets discovered by the Codec discovery process and is 
 instantiated by that.
 - On loading the Codec framework the call to codec.getDefault() might get in 
 at a time where the codecs are not yet fully initialized (because it will 
 instantiate your codec while loading the META-INF). This happens before the 
 Codec class is itself fully statically initialized, so the default codec 
 might be null...
 So relying on Codec.getDefault() in constructors of filter codecs may not 
 work as expected!

 Maybe try it out, was just an idea :-)

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Thursday, February 12, 2015 2:11 AM
 To: java-user@lucene.apache.org
 Subject: A codec moment or pickle

 I have a class that extends FilterCodec. Written against Lucene 4.9, it uses 
 the
 Lucene49Codec.

 Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is
 read-only in 4.10. Is there some way to code one of these to get 'the default
 codec' and not have to chase versions?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
Robert,

Let me lay out the scenario.

Hardware has .5T of Index is relatively small. Application profiling
shows a significant amount of time spent codec-ing.

Options as I see them:

1. Use DPF complete with the irritation of having to have this
spurious codec name in the on-disk format that has nothing to do with
the on-disk format.
2. 'Officially' use the standard codec, and then use something like
AOP to intercept and encapsulate it with the DPF or something else
like it -- essentially, a do-it-myself alternative to convincing the
community here that this is a use case worthy of support.
3. Find some way to move a significant amount of the data in question
out of Lucene altogether into something else which fits nicely
together with filling memory with a cache so that the amount of
codeccing drops below the threshold of interest.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
WHOOPS.

First sentence was, until just before I clicked 'send',

Hardware has .5T of RAM. Index is relatively small  (20g) ...


On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies ben...@basistech.com wrote:
 Robert,

 Let me lay out the scenario.

 Hardware has .5T of Index is relatively small. Application profiling
 shows a significant amount of time spent codec-ing.

 Options as I see them:

 1. Use DPF complete with the irritation of having to have this
 spurious codec name in the on-disk format that has nothing to do with
 the on-disk format.
 2. 'Officially' use the standard codec, and then use something like
 AOP to intercept and encapsulate it with the DPF or something else
 like it -- essentially, a do-it-myself alternative to convincing the
 community here that this is a use case worthy of support.
 3. Find some way to move a significant amount of the data in question
 out of Lucene altogether into something else which fits nicely
 together with filling memory with a cache so that the amount of
 codeccing drops below the threshold of interest.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: occurrence of two terms with the highest frequency

2015-02-12 Thread Ian Lea
I think you can do it with 4 simple queries:

1) +flying +shooting

2) +flying +fighting

etc.

or BooleanQuery equivalents with MUST clauses.  Use
aol.search.TotalHitCountCollector and it should be blazingly fast,
even if you have more that 100 docs.


--
Ian.


On Thu, Feb 12, 2015 at 5:42 PM, Maisnam Ns maisnam...@gmail.com wrote:
 Hi,

 Can someone help me with this use case.

 Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and
 'looking' in100 documents to search for.

 Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents
 where as

 'Flying and 'fighting' co- occurs in 14 documents

 'Flying' and 'looking' co-occurs in 2 documents and so on.

 I have to list them in order or rather show them on a web page
 1. Flying , Shooting -70
 2. Flying , fighting - 14
 3 Flying , looking -2

 How to achieve this and please tell me what kind of query is this
 co-occurrence frequency.
 Is this possible in Lucene.And how to proceed .

 Please help and thanks in advance.

 Regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A codec moment or pickle

2015-02-12 Thread Uwe Schindler
Hi,

How about Codec.getDefault()? It does indeed not necessarily return the newest 
one (if somebody changes the default using Codec.setDefault()), but for your 
use case wrapping the current default one, it should be fine?

I have not tried this yet, but there might be a chicken-egg problem:
- Your codec will have a separate name and be listed in META-INF as service (I 
assume this). So it gets discovered by the Codec discovery process and is 
instantiated by that.
- On loading the Codec framework the call to codec.getDefault() might get in at 
a time where the codecs are not yet fully initialized (because it will 
instantiate your codec while loading the META-INF). This happens before the 
Codec class is itself fully statically initialized, so the default codec might 
be null...
So relying on Codec.getDefault() in constructors of filter codecs may not work 
as expected!

Maybe try it out, was just an idea :-)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Thursday, February 12, 2015 2:11 AM
 To: java-user@lucene.apache.org
 Subject: A codec moment or pickle
 
 I have a class that extends FilterCodec. Written against Lucene 4.9, it uses 
 the
 Lucene49Codec.
 
 Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is
 read-only in 4.10. Is there some way to code one of these to get 'the default
 codec' and not have to chase versions?
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Robert Muir
Honestly i dont agree. I don't know what you are trying to do, but if
you want file format backwards compat working, then you need a
different FilterCodec to match each lucene codec.

Otherwise your codec is broken from a back compat standpoint. Wrapping
the latest is an antipattern here.


On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com wrote:
 Based on reading the same comments you read, I'm pretty doubtful that
 Codec.getDefault() is going to work. It seems to me that this
 situation renders the FilterCodec a bit hard to to use, at least given
 the 'every release deprecates a codec' sort of pattern.



 On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 How about Codec.getDefault()? It does indeed not necessarily return the 
 newest one (if somebody changes the default using Codec.setDefault()), but 
 for your use case wrapping the current default one, it should be fine?

 I have not tried this yet, but there might be a chicken-egg problem:
 - Your codec will have a separate name and be listed in META-INF as service 
 (I assume this). So it gets discovered by the Codec discovery process and is 
 instantiated by that.
 - On loading the Codec framework the call to codec.getDefault() might get in 
 at a time where the codecs are not yet fully initialized (because it will 
 instantiate your codec while loading the META-INF). This happens before the 
 Codec class is itself fully statically initialized, so the default codec 
 might be null...
 So relying on Codec.getDefault() in constructors of filter codecs may not 
 work as expected!

 Maybe try it out, was just an idea :-)

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:bimargul...@gmail.com]
 Sent: Thursday, February 12, 2015 2:11 AM
 To: java-user@lucene.apache.org
 Subject: A codec moment or pickle

 I have a class that extends FilterCodec. Written against Lucene 4.9, it 
 uses the
 Lucene49Codec.

 Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec 
 is
 read-only in 4.10. Is there some way to code one of these to get 'the 
 default
 codec' and not have to chase versions?

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Robert Muir
On Thu, Feb 12, 2015 at 8:51 AM, Benson Margulies ben...@basistech.com wrote:
 On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote:

 Honestly i dont agree. I don't know what you are trying to do, but if
 you want file format backwards compat working, then you need a
 different FilterCodec to match each lucene codec.

 Otherwise your codec is broken from a back compat standpoint. Wrapping
 the latest is an antipattern here.


 I understand this logic. It leaves me wandering between:

 1: My old desire to convince you that there should be a way to do
 DirectPostingFormat's caching without being a codec at all. Unfortunately,
 I got dragged away from the benchmarking that might have been persuasive.

Honestly, benchmarking won't persuade me. I think this is a trap and I
don't want more of these traps.
We already have RAMDirectory(Directory other) which is this exact same
trap. We don't need more duplicates of it.
But this Direct, man oh man is it even worse by far, because it uses
32 and 64 bits for things that really should typically only be like 8
bits with compression, so it just hogs up RAM.

There isnt a benchmark on this planet that can convince me it should
get any higher status. On the contrary, I want to send it into a deep
dark dungeon in siberia.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: A codec moment or pickle

2015-02-12 Thread Uwe Schindler
Hi,

FYI, this is the same issues like Locales have/had in ICU! If you try to render 
an error message in Locales's constructors, this breaks with NPE - because 
default Locale is not yet there... I think they implemented some fallback 
that is guaranteed to be there.

But this would not help you, too - you need the default Codec be available at 
the time your custom codec is loaded... Same issue, no idea how to solve this.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Benson Margulies [mailto:ben...@basistech.com]
 Sent: Thursday, February 12, 2015 11:34 AM
 To: java-user@lucene apache. org
 Subject: Re: A codec moment or pickle
 
 Based on reading the same comments you read, I'm pretty doubtful that
 Codec.getDefault() is going to work. It seems to me that this situation
 renders the FilterCodec a bit hard to to use, at least given the 'every 
 release
 deprecates a codec' sort of pattern.
 
 
 
 On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
  Hi,
 
  How about Codec.getDefault()? It does indeed not necessarily return the
 newest one (if somebody changes the default using Codec.setDefault()), but
 for your use case wrapping the current default one, it should be fine?
 
  I have not tried this yet, but there might be a chicken-egg problem:
  - Your codec will have a separate name and be listed in META-INF as service
 (I assume this). So it gets discovered by the Codec discovery process and is
 instantiated by that.
  - On loading the Codec framework the call to codec.getDefault() might get
 in at a time where the codecs are not yet fully initialized (because it will
 instantiate your codec while loading the META-INF). This happens before the
 Codec class is itself fully statically initialized, so the default codec 
 might be
 null...
  So relying on Codec.getDefault() in constructors of filter codecs may not
 work as expected!
 
  Maybe try it out, was just an idea :-)
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Benson Margulies [mailto:bimargul...@gmail.com]
  Sent: Thursday, February 12, 2015 2:11 AM
  To: java-user@lucene.apache.org
  Subject: A codec moment or pickle
 
  I have a class that extends FilterCodec. Written against Lucene 4.9,
  it uses the Lucene49Codec.
 
  Dropped into a copy of Solr with Lucene 4.10, it discovers that this
  codec is read-only in 4.10. Is there some way to code one of these to
  get 'the default codec' and not have to chase versions?
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote:

 Honestly i dont agree. I don't know what you are trying to do, but if
 you want file format backwards compat working, then you need a
 different FilterCodec to match each lucene codec.

 Otherwise your codec is broken from a back compat standpoint. Wrapping
 the latest is an antipattern here.


I understand this logic. It leaves me wandering between:

1: My old desire to convince you that there should be a way to do
DirectPostingFormat's caching without being a codec at all. Unfortunately,
I got dragged away from the benchmarking that might have been persuasive.

2: The problem of deprecation. I give someone a jar-of-code that works fine
with Lucene 4.9. It does not work with 4.10. Now, maybe the answer here is
that the codec deprecation is fundamental to the definition of moving from
4.9 to 4.10, so having a codec means that I'm really married to a process
of making releases that mirror Lucene releases.






 On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com
 wrote:
  Based on reading the same comments you read, I'm pretty doubtful that
  Codec.getDefault() is going to work. It seems to me that this
  situation renders the FilterCodec a bit hard to to use, at least given
  the 'every release deprecates a codec' sort of pattern.
 
 
 
  On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote:
  Hi,
 
  How about Codec.getDefault()? It does indeed not necessarily return the
 newest one (if somebody changes the default using Codec.setDefault()), but
 for your use case wrapping the current default one, it should be fine?
 
  I have not tried this yet, but there might be a chicken-egg problem:
  - Your codec will have a separate name and be listed in META-INF as
 service (I assume this). So it gets discovered by the Codec discovery
 process and is instantiated by that.
  - On loading the Codec framework the call to codec.getDefault() might
 get in at a time where the codecs are not yet fully initialized (because it
 will instantiate your codec while loading the META-INF). This happens
 before the Codec class is itself fully statically initialized, so the
 default codec might be null...
  So relying on Codec.getDefault() in constructors of filter codecs may
 not work as expected!
 
  Maybe try it out, was just an idea :-)
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Benson Margulies [mailto:bimargul...@gmail.com]
  Sent: Thursday, February 12, 2015 2:11 AM
  To: java-user@lucene.apache.org
  Subject: A codec moment or pickle
 
  I have a class that extends FilterCodec. Written against Lucene 4.9,
 it uses the
  Lucene49Codec.
 
  Dropped into a copy of Solr with Lucene 4.10, it discovers that this
 codec is
  read-only in 4.10. Is there some way to code one of these to get 'the
 default
  codec' and not have to chase versions?
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




RE: Proximity query

2015-02-12 Thread Allison, Timothy B.
Might also look at concordance code on LUCENE-5317 and here:

https://github.com/tballison/lucene-addons/tree/master/lucene-5317

Let me know if you have any questions.

-Original Message-
From: Maisnam Ns [mailto:maisnam...@gmail.com] 
Sent: Thursday, February 12, 2015 11:57 AM
To: java-user@lucene.apache.org
Subject: Re: Proximity query

Hi Shah,

Thanks for your reply. Will try to google SpanQuery meanwhile if you have
some links can you please share

Thanks

On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote:

 This concept is called Proximity Search in general.

 In Lucene they are achieved using SpanQuery.

 On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote:

  Hi,
 
  Can someone help me if this use case is possible or not with lucene
 
  Use case: I have a string say 'Japan' appearing in 10 documents and I
 want
  to get back , say some results which contain two words before 'Japan' and
  two words after 'Japan' may be something like this ' Economy of Japan is
  growing' etc.
 
   If it is not possible where should I look for such queries
 
  Thanks
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-12 Thread Robert Muir
On Thu, Feb 12, 2015 at 11:58 AM, McKinley, James T
james.mckin...@cengage.com wrote:
 Hi Robert,

 Thanks for responding to my message.  Are you saying that you or others have 
 encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with 
 G1 and was it on Windows or on Linux?  If so, where can I find out more?  I 
 only looked into the one bug because that was the only bug I saw on the 
 https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1.  If 
 there are other Lucene on Java 1.7 with G1 related bugs how can I find them?  
 Also, are these failures something that would be triggered by running the 
 standard Lucene 4.8.1 test suite or are there other tests I should run in 
 order to reproduce these bugs?

You can't reproduce them easily. That is the nature of such bugs. When
i see the crashes, i generally try to confirm its not a lucene bug.
E.g. ill run it a thousand times with/without g1 and if only g1 fails,
i move on with life. There just isnt time.

Occasionally G1 frustrates me enough, ill go and open an issue, like
this one: https://issues.apache.org/jira/browse/LUCENE-6098

Thats a perfect example of what these bugs look like, horribly scary
failures that can cause bad things, and reproduce like 1/1000 times
with G1, essentially impossible to debug. They happen quite often in
our various jenkins servers, on both 32-bit and 64-bit, and even with
the most recent (e.g. 1.8.0_25 or 1.8.0_40-ea) jvms.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Proximity query

2015-02-12 Thread Sujit Pal
I did something like this sometime back. The objective was to find patterns
surrounding some keywords of interest so I could find keywords similar to
the ones I was looking for, sort of like a poor man's word2vec. It uses
SpanQuery as Jigar said, and you can find the code here (I believe it was
written against Lucene 3.x so you may have to upgrade it if you are using
Lucene 4.x):

http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html

-sujit


On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote:

 Hi Shah,

 Thanks for your reply. Will try to google SpanQuery meanwhile if you have
 some links can you please share

 Thanks

 On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com
 wrote:

  This concept is called Proximity Search in general.
 
  In Lucene they are achieved using SpanQuery.
 
  On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com
 wrote:
 
   Hi,
  
   Can someone help me if this use case is possible or not with lucene
  
   Use case: I have a string say 'Japan' appearing in 10 documents and I
  want
   to get back , say some results which contain two words before 'Japan'
 and
   two words after 'Japan' may be something like this ' Economy of Japan
 is
   growing' etc.
  
If it is not possible where should I look for such queries
  
   Thanks
  
 



Proximity query

2015-02-12 Thread Maisnam Ns
Hi,

Can someone help me if this use case is possible or not with lucene

Use case: I have a string say 'Japan' appearing in 10 documents and I want
to get back , say some results which contain two words before 'Japan' and
two words after 'Japan' may be something like this ' Economy of Japan is
growing' etc.

 If it is not possible where should I look for such queries

Thanks


RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-12 Thread McKinley, James T
Hi Robert,

Thanks for responding to my message.  Are you saying that you or others have 
encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with G1 
and was it on Windows or on Linux?  If so, where can I find out more?  I only 
looked into the one bug because that was the only bug I saw on the 
https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1.  If 
there are other Lucene on Java 1.7 with G1 related bugs how can I find them?  
Also, are these failures something that would be triggered by running the 
standard Lucene 4.8.1 test suite or are there other tests I should run in order 
to reproduce these bugs?

We have been running the user facing runtime portion of our search engine using 
Java SE 1.7.0_04 with the G1 garbage collector for almost two years now and I 
was not aware of these JVM bugs with Lucene.  However, the indexing workflow 
portion of our system uses Parallel GC since it is a batch system and is not 
constrained by user facing response time requirements.  From what I understood 
from the JDK-8038348 bug comments, it is a compiler bug that can be tripped 
when using G1 and if the compiler is producing incorrect code I guess any 
behaviour is possible.  

We have experienced index corruption 3 times so far since upgrading to Lucene 
4.8.1 from Lucene 4.4 (I don't recall any corruption prior to moving to 4.8) 
but as I said we are using Parallel GC (-XX:+UseParallelGC 
-XX:+UseParallelOldGC) in the indexing workflow that writes the indexes, we 
only use G1 in the runtime system that does no index writing.  We have twice 
encountered index corruption during the index creation workflow (the runtime 
system never opened the indexes) and once found the index to be corrupt when we 
restarted the runtime on it.  So this may just be JVM bugs that can be 
triggered regardless of which garbage collector is used (which is of course 
even worse).  We do have relatively large indexes (530M+ docs total across 30 
partitions), so maybe we're more likely to see corruption even when using 
Parallel GC?  We haven't seen any corruption since the end of September 2014, 
but we have now added an index checking step to our workflow to ensure we don't 
ever point the runtime at a bad batch.  When we've encountered index corruption 
in the past we've just deleted the bad batch and re-ran the workflow and the 
subsequent runs have succeeded.  We've never figured out what caused the 
corruption.  Thanks for any further help.

Jim

From: Robert Muir [rcm...@gmail.com]
Sent: Wednesday, February 11, 2015 5:05 PM
To: java-user
Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

No, because you only looked into one bug. We have seen and do so see
many G1 related test failures, including latest 1.8.0 update 40 early
access editions. These include things like corruption.

I added this message with *every intention* to scare away users,
because I don't want them having index corruption.

I am sick of people asking but isnt it fine on the latest version
and so on. It is not.

On Wed, Feb 11, 2015 at 11:41 AM, McKinley, James T
james.mckin...@cengage.com wrote:
 Hi,

 A couple mailing list members have brought the following paragraph from the 
 https://wiki.apache.org/lucene-java/JavaBugs page to my attention:

 Do not, under any circumstances, run Lucene with the G1 garbage collector. 
 Lucene's test suite fails with the G1 garbage collector on a regular basis, 
 including bugs that cause index corruption. There is no person on this planet 
 that seems to understand such bugs (see 
 https://bugs.openjdk.java.net/browse/JDK-8038348, open for over a year), so 
 don't count on the situation changing soon. This information is not out of 
 date, and don't think that the next oracle java release will fix the 
 situation.

 Since we run Lucene 4.8.1 on Java(TM) SE Runtime Environment (build 
 1.7.0_04-b20) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) 
 using G1GC in production I felt I should look into the issue and see if it is 
 reproducible in our environment.  First I read the bug linked in the above 
 paragraph as well as https://issues.apache.org/jira/browse/LUCENE-5168 and it 
 appears quite a bit of work in trying to track down this bug has already been 
 done by Dawid Weiss and Vladmir Kozlov but it seems it is limited to the 
 32-bit JVM (maybe even only on Windows), to quote Dawid Weiss from the Jira 
 bug:

 My quest continues

 I thought it'd be interesting to see how far back I can trace this
 issue. I fetched the official binaries for jdk17 (windows, 32-bit) and
 did a binary search with the failing Lucene test command. The results
 show that, in short:

 ...
 jdk1.7.0_03: PASSES
 jdk1.7.0_04: FAILS
 ...

 and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT*
 exhibit the issue (and neither does any version afterwards, it only
 happens on 32-bit; perhaps it's because of smaller number of 

Re: Proximity query

2015-02-12 Thread Jigar Shah
This concept is called Proximity Search in general.

In Lucene they are achieved using SpanQuery.

On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote:

 Hi,

 Can someone help me if this use case is possible or not with lucene

 Use case: I have a string say 'Japan' appearing in 10 documents and I want
 to get back , say some results which contain two words before 'Japan' and
 two words after 'Japan' may be something like this ' Economy of Japan is
 growing' etc.

  If it is not possible where should I look for such queries

 Thanks



Re: Proximity query

2015-02-12 Thread Maisnam Ns
Hi Shah,

Thanks for your reply. Will try to google SpanQuery meanwhile if you have
some links can you please share

Thanks

On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote:

 This concept is called Proximity Search in general.

 In Lucene they are achieved using SpanQuery.

 On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote:

  Hi,
 
  Can someone help me if this use case is possible or not with lucene
 
  Use case: I have a string say 'Japan' appearing in 10 documents and I
 want
  to get back , say some results which contain two words before 'Japan' and
  two words after 'Japan' may be something like this ' Economy of Japan is
  growing' etc.
 
   If it is not possible where should I look for such queries
 
  Thanks
 



Re: Proximity query

2015-02-12 Thread Maisnam Ns
Hi Allison and Sujit,

Thanks so much for your links I am so happy I am looking at  exactly the
links that almost covers my use case.

Allison, sure will get back to you if I have some more questions.

Regards
NS





On Thu, Feb 12, 2015 at 10:49 PM, Sujit Pal sujit@comcast.net wrote:

 I did something like this sometime back. The objective was to find patterns
 surrounding some keywords of interest so I could find keywords similar to
 the ones I was looking for, sort of like a poor man's word2vec. It uses
 SpanQuery as Jigar said, and you can find the code here (I believe it was
 written against Lucene 3.x so you may have to upgrade it if you are using
 Lucene 4.x):


 http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html

 -sujit


 On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote:

  Hi Shah,
 
  Thanks for your reply. Will try to google SpanQuery meanwhile if you have
  some links can you please share
 
  Thanks
 
  On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com
  wrote:
 
   This concept is called Proximity Search in general.
  
   In Lucene they are achieved using SpanQuery.
  
   On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com
  wrote:
  
Hi,
   
Can someone help me if this use case is possible or not with lucene
   
Use case: I have a string say 'Japan' appearing in 10 documents and I
   want
to get back , say some results which contain two words before 'Japan'
  and
two words after 'Japan' may be something like this ' Economy of Japan
  is
growing' etc.
   
 If it is not possible where should I look for such queries
   
Thanks
   
  
 



occurrence of two terms with the highest frequency

2015-02-12 Thread Maisnam Ns
Hi,

Can someone help me with this use case.

Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and
'looking' in100 documents to search for.

Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents
where as

'Flying and 'fighting' co- occurs in 14 documents

'Flying' and 'looking' co-occurs in 2 documents and so on.

I have to list them in order or rather show them on a web page
1. Flying , Shooting -70
2. Flying , fighting - 14
3 Flying , looking -2

How to achieve this and please tell me what kind of query is this
co-occurrence frequency.
Is this possible in Lucene.And how to proceed .

Please help and thanks in advance.

Regards