Re: Proximity query
Hi, I googled it but could not find the jars of these classes can some help me where to get the jars import org.apache.lucene.corpus.stats.IDFCalc; import org.apache.lucene.corpus.stats.TFIDFPriorityQueue; import org.apache.lucene.corpus.stats.TermIDF; Thanks On Thu, Feb 12, 2015 at 11:01 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi Allison and Sujit, Thanks so much for your links I am so happy I am looking at exactly the links that almost covers my use case. Allison, sure will get back to you if I have some more questions. Regards NS On Thu, Feb 12, 2015 at 10:49 PM, Sujit Pal sujit@comcast.net wrote: I did something like this sometime back. The objective was to find patterns surrounding some keywords of interest so I could find keywords similar to the ones I was looking for, sort of like a poor man's word2vec. It uses SpanQuery as Jigar said, and you can find the code here (I believe it was written against Lucene 3.x so you may have to upgrade it if you are using Lucene 4.x): http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html -sujit On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote: Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
Re: A codec moment or pickle
Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small. Application profiling shows a significant amount of time spent codec-ing. Options as I see them: 1. Use DPF complete with the irritation of having to have this spurious codec name in the on-disk format that has nothing to do with the on-disk format. 2. 'Officially' use the standard codec, and then use something like AOP to intercept and encapsulate it with the DPF or something else like it -- essentially, a do-it-myself alternative to convincing the community here that this is a use case worthy of support. 3. Find some way to move a significant amount of the data in question out of Lucene altogether into something else which fits nicely together with filling memory with a cache so that the amount of codeccing drops below the threshold of interest. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
WHOOPS. First sentence was, until just before I clicked 'send', Hardware has .5T of RAM. Index is relatively small (20g) ... On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies ben...@basistech.com wrote: Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small. Application profiling shows a significant amount of time spent codec-ing. Options as I see them: 1. Use DPF complete with the irritation of having to have this spurious codec name in the on-disk format that has nothing to do with the on-disk format. 2. 'Officially' use the standard codec, and then use something like AOP to intercept and encapsulate it with the DPF or something else like it -- essentially, a do-it-myself alternative to convincing the community here that this is a use case worthy of support. 3. Find some way to move a significant amount of the data in question out of Lucene altogether into something else which fits nicely together with filling memory with a cache so that the amount of codeccing drops below the threshold of interest. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: occurrence of two terms with the highest frequency
I think you can do it with 4 simple queries: 1) +flying +shooting 2) +flying +fighting etc. or BooleanQuery equivalents with MUST clauses. Use aol.search.TotalHitCountCollector and it should be blazingly fast, even if you have more that 100 docs. -- Ian. On Thu, Feb 12, 2015 at 5:42 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me with this use case. Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and 'looking' in100 documents to search for. Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents where as 'Flying and 'fighting' co- occurs in 14 documents 'Flying' and 'looking' co-occurs in 2 documents and so on. I have to list them in order or rather show them on a web page 1. Flying , Shooting -70 2. Flying , fighting - 14 3 Flying , looking -2 How to achieve this and please tell me what kind of query is this co-occurrence frequency. Is this possible in Lucene.And how to proceed . Please help and thanks in advance. Regards - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: A codec moment or pickle
Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different FilterCodec to match each lucene codec. Otherwise your codec is broken from a back compat standpoint. Wrapping the latest is an antipattern here. On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com wrote: Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
On Thu, Feb 12, 2015 at 8:51 AM, Benson Margulies ben...@basistech.com wrote: On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote: Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different FilterCodec to match each lucene codec. Otherwise your codec is broken from a back compat standpoint. Wrapping the latest is an antipattern here. I understand this logic. It leaves me wandering between: 1: My old desire to convince you that there should be a way to do DirectPostingFormat's caching without being a codec at all. Unfortunately, I got dragged away from the benchmarking that might have been persuasive. Honestly, benchmarking won't persuade me. I think this is a trap and I don't want more of these traps. We already have RAMDirectory(Directory other) which is this exact same trap. We don't need more duplicates of it. But this Direct, man oh man is it even worse by far, because it uses 32 and 64 bits for things that really should typically only be like 8 bits with compression, so it just hogs up RAM. There isnt a benchmark on this planet that can convince me it should get any higher status. On the contrary, I want to send it into a deep dark dungeon in siberia. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: A codec moment or pickle
Hi, FYI, this is the same issues like Locales have/had in ICU! If you try to render an error message in Locales's constructors, this breaks with NPE - because default Locale is not yet there... I think they implemented some fallback that is guaranteed to be there. But this would not help you, too - you need the default Codec be available at the time your custom codec is loaded... Same issue, no idea how to solve this. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Thursday, February 12, 2015 11:34 AM To: java-user@lucene apache. org Subject: Re: A codec moment or pickle Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: A codec moment or pickle
On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote: Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different FilterCodec to match each lucene codec. Otherwise your codec is broken from a back compat standpoint. Wrapping the latest is an antipattern here. I understand this logic. It leaves me wandering between: 1: My old desire to convince you that there should be a way to do DirectPostingFormat's caching without being a codec at all. Unfortunately, I got dragged away from the benchmarking that might have been persuasive. 2: The problem of deprecation. I give someone a jar-of-code that works fine with Lucene 4.9. It does not work with 4.10. Now, maybe the answer here is that the codec deprecation is fundamental to the definition of moving from 4.9 to 4.10, so having a codec means that I'm really married to a process of making releases that mirror Lucene releases. On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com wrote: Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems to me that this situation renders the FilterCodec a bit hard to to use, at least given the 'every release deprecates a codec' sort of pattern. On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, How about Codec.getDefault()? It does indeed not necessarily return the newest one (if somebody changes the default using Codec.setDefault()), but for your use case wrapping the current default one, it should be fine? I have not tried this yet, but there might be a chicken-egg problem: - Your codec will have a separate name and be listed in META-INF as service (I assume this). So it gets discovered by the Codec discovery process and is instantiated by that. - On loading the Codec framework the call to codec.getDefault() might get in at a time where the codecs are not yet fully initialized (because it will instantiate your codec while loading the META-INF). This happens before the Codec class is itself fully statically initialized, so the default codec might be null... So relying on Codec.getDefault() in constructors of filter codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday, February 12, 2015 2:11 AM To: java-user@lucene.apache.org Subject: A codec moment or pickle I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Proximity query
Might also look at concordance code on LUCENE-5317 and here: https://github.com/tballison/lucene-addons/tree/master/lucene-5317 Let me know if you have any questions. -Original Message- From: Maisnam Ns [mailto:maisnam...@gmail.com] Sent: Thursday, February 12, 2015 11:57 AM To: java-user@lucene.apache.org Subject: Re: Proximity query Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)
On Thu, Feb 12, 2015 at 11:58 AM, McKinley, James T james.mckin...@cengage.com wrote: Hi Robert, Thanks for responding to my message. Are you saying that you or others have encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with G1 and was it on Windows or on Linux? If so, where can I find out more? I only looked into the one bug because that was the only bug I saw on the https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1. If there are other Lucene on Java 1.7 with G1 related bugs how can I find them? Also, are these failures something that would be triggered by running the standard Lucene 4.8.1 test suite or are there other tests I should run in order to reproduce these bugs? You can't reproduce them easily. That is the nature of such bugs. When i see the crashes, i generally try to confirm its not a lucene bug. E.g. ill run it a thousand times with/without g1 and if only g1 fails, i move on with life. There just isnt time. Occasionally G1 frustrates me enough, ill go and open an issue, like this one: https://issues.apache.org/jira/browse/LUCENE-6098 Thats a perfect example of what these bugs look like, horribly scary failures that can cause bad things, and reproduce like 1/1000 times with G1, essentially impossible to debug. They happen quite often in our various jenkins servers, on both 32-bit and 64-bit, and even with the most recent (e.g. 1.8.0_25 or 1.8.0_40-ea) jvms. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Proximity query
I did something like this sometime back. The objective was to find patterns surrounding some keywords of interest so I could find keywords similar to the ones I was looking for, sort of like a poor man's word2vec. It uses SpanQuery as Jigar said, and you can find the code here (I believe it was written against Lucene 3.x so you may have to upgrade it if you are using Lucene 4.x): http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html -sujit On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote: Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
Proximity query
Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)
Hi Robert, Thanks for responding to my message. Are you saying that you or others have encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with G1 and was it on Windows or on Linux? If so, where can I find out more? I only looked into the one bug because that was the only bug I saw on the https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1. If there are other Lucene on Java 1.7 with G1 related bugs how can I find them? Also, are these failures something that would be triggered by running the standard Lucene 4.8.1 test suite or are there other tests I should run in order to reproduce these bugs? We have been running the user facing runtime portion of our search engine using Java SE 1.7.0_04 with the G1 garbage collector for almost two years now and I was not aware of these JVM bugs with Lucene. However, the indexing workflow portion of our system uses Parallel GC since it is a batch system and is not constrained by user facing response time requirements. From what I understood from the JDK-8038348 bug comments, it is a compiler bug that can be tripped when using G1 and if the compiler is producing incorrect code I guess any behaviour is possible. We have experienced index corruption 3 times so far since upgrading to Lucene 4.8.1 from Lucene 4.4 (I don't recall any corruption prior to moving to 4.8) but as I said we are using Parallel GC (-XX:+UseParallelGC -XX:+UseParallelOldGC) in the indexing workflow that writes the indexes, we only use G1 in the runtime system that does no index writing. We have twice encountered index corruption during the index creation workflow (the runtime system never opened the indexes) and once found the index to be corrupt when we restarted the runtime on it. So this may just be JVM bugs that can be triggered regardless of which garbage collector is used (which is of course even worse). We do have relatively large indexes (530M+ docs total across 30 partitions), so maybe we're more likely to see corruption even when using Parallel GC? We haven't seen any corruption since the end of September 2014, but we have now added an index checking step to our workflow to ensure we don't ever point the runtime at a bad batch. When we've encountered index corruption in the past we've just deleted the bad batch and re-ran the workflow and the subsequent runs have succeeded. We've never figured out what caused the corruption. Thanks for any further help. Jim From: Robert Muir [rcm...@gmail.com] Sent: Wednesday, February 11, 2015 5:05 PM To: java-user Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8) No, because you only looked into one bug. We have seen and do so see many G1 related test failures, including latest 1.8.0 update 40 early access editions. These include things like corruption. I added this message with *every intention* to scare away users, because I don't want them having index corruption. I am sick of people asking but isnt it fine on the latest version and so on. It is not. On Wed, Feb 11, 2015 at 11:41 AM, McKinley, James T james.mckin...@cengage.com wrote: Hi, A couple mailing list members have brought the following paragraph from the https://wiki.apache.org/lucene-java/JavaBugs page to my attention: Do not, under any circumstances, run Lucene with the G1 garbage collector. Lucene's test suite fails with the G1 garbage collector on a regular basis, including bugs that cause index corruption. There is no person on this planet that seems to understand such bugs (see https://bugs.openjdk.java.net/browse/JDK-8038348, open for over a year), so don't count on the situation changing soon. This information is not out of date, and don't think that the next oracle java release will fix the situation. Since we run Lucene 4.8.1 on Java(TM) SE Runtime Environment (build 1.7.0_04-b20) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) using G1GC in production I felt I should look into the issue and see if it is reproducible in our environment. First I read the bug linked in the above paragraph as well as https://issues.apache.org/jira/browse/LUCENE-5168 and it appears quite a bit of work in trying to track down this bug has already been done by Dawid Weiss and Vladmir Kozlov but it seems it is limited to the 32-bit JVM (maybe even only on Windows), to quote Dawid Weiss from the Jira bug: My quest continues I thought it'd be interesting to see how far back I can trace this issue. I fetched the official binaries for jdk17 (windows, 32-bit) and did a binary search with the failing Lucene test command. The results show that, in short: ... jdk1.7.0_03: PASSES jdk1.7.0_04: FAILS ... and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT* exhibit the issue (and neither does any version afterwards, it only happens on 32-bit; perhaps it's because of smaller number of
Re: Proximity query
This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
Re: Proximity query
Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
Re: Proximity query
Hi Allison and Sujit, Thanks so much for your links I am so happy I am looking at exactly the links that almost covers my use case. Allison, sure will get back to you if I have some more questions. Regards NS On Thu, Feb 12, 2015 at 10:49 PM, Sujit Pal sujit@comcast.net wrote: I did something like this sometime back. The objective was to find patterns surrounding some keywords of interest so I could find keywords similar to the ones I was looking for, sort of like a poor man's word2vec. It uses SpanQuery as Jigar said, and you can find the code here (I believe it was written against Lucene 3.x so you may have to upgrade it if you are using Lucene 4.x): http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html -sujit On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote: Hi Shah, Thanks for your reply. Will try to google SpanQuery meanwhile if you have some links can you please share Thanks On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com wrote: This concept is called Proximity Search in general. In Lucene they are achieved using SpanQuery. On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com wrote: Hi, Can someone help me if this use case is possible or not with lucene Use case: I have a string say 'Japan' appearing in 10 documents and I want to get back , say some results which contain two words before 'Japan' and two words after 'Japan' may be something like this ' Economy of Japan is growing' etc. If it is not possible where should I look for such queries Thanks
occurrence of two terms with the highest frequency
Hi, Can someone help me with this use case. Use case: Say there are 4 key words 'Flying', 'Shooting', 'fighting' and 'looking' in100 documents to search for. Consider 'Flying' and 'Shooting' co- occurs (together) in 70 documents where as 'Flying and 'fighting' co- occurs in 14 documents 'Flying' and 'looking' co-occurs in 2 documents and so on. I have to list them in order or rather show them on a web page 1. Flying , Shooting -70 2. Flying , fighting - 14 3 Flying , looking -2 How to achieve this and please tell me what kind of query is this co-occurrence frequency. Is this possible in Lucene.And how to proceed . Please help and thanks in advance. Regards