Re: First cut at web-based Luke for contrib
The 17 MB bundle I provided is essentially the source plus dependencies, the bulk of which is jars, mainly the compile-time dependency gwt-dev-windows.jar weighing in at 10MB. The built WAR file is only 1.5 meg. The WAR file bundled with Jetty (as a convenience) is 8 meg. It may be possible to use Proguard or something like that to try slim down the gwt-dev-windows.jar. Cheers, Mark Doug Cutting wrote: Mark Miller wrote: My only concern is with the size increase this will give to the Lucene jar. Another 17 meg - yikes! You mean the release tar file, not the jar, right? Is the size of the release really an issue for folks? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential bug in StandardTokenizerImpl
I agree that being backward compatible is important. But ... I also work at a company that delivers search solutions to many customers. Sometimes, customers are being told that a specific fix will require them to rebuild their indexes. Customers can then choose whether to install the fix or not. However, from your statement below I gather that once Lucene 3.0 will be out, we won't have to be backward compatible, and that fix can go into that release ... if I'm right, then someone can mark that issue for 3.0 and not 2.3 (I'm not sure I have the permissions to do so). Isn't there a way to include a fix that you can choose whether to install or not? For example, I may want to download 2.3 (when it's out) and apply this patch only. I'm sure there's a way to do it. If there is, we could publish this as official in 3.0 and patch available for 2.3 (I fixed it only in jflex, but can easily produce a patch for .jj file, so if will fix 2.2version as well). My only concern is that this patch will get lost if we don't mark it for any release ... Shai On Nov 28, 2007 9:18 PM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : Thanks to Shai Erera for traslating the discussion into the developers' > : list. I am surprised about Chris Hostetter's response, as this issue was > > to clarify: i'm not saying that the current behavior is ideal, or even > correct -- i'm saying the current behavior is the current behavior, and > changing it could easily break existing indexes -- something that the > Lucene upgrade contract does not allow... > > http://wiki.apache.org/lucene-java/BackwardsCompatibility > > specificly: if someone built an index with 2.2, that index needs to work > when queried by an app running 2.3 .. if we change the StandardTokenizer > to treat this differnetly, that won't work. > > In some cases, being backwards compatible is more important then being > "correct" ... i'm not 100% certain that this is one of those cases, i'm > just pointing out that there is more to this issue then just a one line > patch to some code. > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
[jira] Resolved: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code
[ https://issues.apache.org/jira/browse/LUCENE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-1071. --- Resolution: Fixed Committed. > SegmentMerger doesn't set payload bit in new optimized code > --- > > Key: LUCENE-1071 > URL: https://issues.apache.org/jira/browse/LUCENE-1071 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.3 >Reporter: Michael Busch >Assignee: Michael Busch > Fix For: 2.3 > > Attachments: lucene-1071.patch > > > In the new optimized code in SegmentMerger the payload bit is not set > correctly > in the merged segment. This means that we loose all payloads during a merge! > The Payloads unit test doesn't catch this. Now that we have the new > DocumentsWriter we buffer much more docs by default then before. This means > that the test cases can't assume anymore that the DocsWriter flushes after 10 > docs by default. TestPayloads however falsely assumed this, which means that > no > merges happen anymore in TestPayloads. We should check whether there are > other testcases that rely on this. > The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a > patch > soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code
[ https://issues.apache.org/jira/browse/LUCENE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1071: -- Attachment: lucene-1071.patch I'm going to commit this very soon. > SegmentMerger doesn't set payload bit in new optimized code > --- > > Key: LUCENE-1071 > URL: https://issues.apache.org/jira/browse/LUCENE-1071 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.3 >Reporter: Michael Busch >Assignee: Michael Busch > Fix For: 2.3 > > Attachments: lucene-1071.patch > > > In the new optimized code in SegmentMerger the payload bit is not set > correctly > in the merged segment. This means that we loose all payloads during a merge! > The Payloads unit test doesn't catch this. Now that we have the new > DocumentsWriter we buffer much more docs by default then before. This means > that the test cases can't assume anymore that the DocsWriter flushes after 10 > docs by default. TestPayloads however falsely assumed this, which means that > no > merges happen anymore in TestPayloads. We should check whether there are > other testcases that rely on this. > The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a > patch > soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code
SegmentMerger doesn't set payload bit in new optimized code --- Key: LUCENE-1071 URL: https://issues.apache.org/jira/browse/LUCENE-1071 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.3 Reporter: Michael Busch Assignee: Michael Busch Fix For: 2.3 In the new optimized code in SegmentMerger the payload bit is not set correctly in the merged segment. This means that we loose all payloads during a merge! The Payloads unit test doesn't catch this. Now that we have the new DocumentsWriter we buffer much more docs by default then before. This means that the test cases can't assume anymore that the DocsWriter flushes after 10 docs by default. TestPayloads however falsely assumed this, which means that no merges happen anymore in TestPayloads. We should check whether there are other testcases that rely on this. The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Payload Loading and Reloading
In working on LUCENE-1001, things are getting a bit complicated with loading payloads in overlapping spans (which causes the dreaded Can't load payload more than once error). This got me thinking about why we need the rule that payloads can only be loaded once. I forget the reasoning behind this. Can we just store where the current position before we load the payload and then seek back to that point if we need to load the payload again? I suppose in the case of really large payloads the seek on the IndexInput could be expensive, but in reality, most payloads aren't likely to be more than a few bytes, right? There also seems to be some interactions with the lazy skipping that I haven't quite pinned down yet. What else am I forgetting? The other alternative I can think of is I could cache the payloads, but that seems unwieldy too. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: First cut at web-based Luke for contrib
Yes, didn't mean jar, meant zip...or tar. I guess this may not be a sticky point for most people. For me, I just get knee jerk seeing the dist size quadruple for a feature, that is in reality a very small part of the dist. I am not arguing against adding it, just noting my stomach drop. Take it for what you will. It wouldn't stop me from downloading Lucene. - Mark Doug Cutting wrote: Mark Miller wrote: My only concern is with the size increase this will give to the Lucene jar. Another 17 meg - yikes! You mean the release tar file, not the jar, right? Is the size of the release really an issue for folks? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: First cut at web-based Luke for contrib
Mark Miller wrote: My only concern is with the size increase this will give to the Lucene jar. Another 17 meg - yikes! You mean the release tar file, not the jar, right? Is the size of the release really an issue for folks? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: First cut at web-based Luke for contrib
Compiled and ran it on Vista. Very cool. I am also a huge GWT fan. This is a great start. Only issue I ran into was also the scrolling issue when selecting the drive...but based on the TODO: comments, it appears you have seen that. Grant: It doesn't ask permission to read your hard drive because a webapp running in Jetty is reading the server hard drive and sending the info to GWT with ajax rpc calls. My only concern is with the size increase this will give to the Lucene jar. Another 17 meg - yikes! - Mark Grant Ingersoll wrote: Seriously cool! On Nov 28, 2007, at 6:13 PM, markharw00d wrote: Any takers to test this contrib layout before I commit it? http://www.inperspective.com/lucene/webluke.zip This is a (17MB) zip file which you can unzip to a new "webluke" directory under your copy of lucene/contrib and then run the usual Lucene Ant build ( or at least "ant build-contrib"). You should then find under build/contrib/webluke/WebLuke.war plus a Jetty-based server in build/contrib/webluke/dist which can be started using java -jar start.jar. A Luke webapp should then be available on The zip doesn't have a parent dir, so I would put it under contrib/webluke http://localhost:8080/WebLuke I've tested building here on XP but want to check that the GWT compile task works ok for others as the Google compiler is packaged in a platform-specific windows jar. I've tested this ant build with the windows jar on a Linux box and all was OK so I'm guessing the platform-specific bits of the google dev tools are related to the browser choices used in their "hosted" dev mode rather than the Java-to-Javascript compiler. Works for me on OS X 10.5 using Java 1.5. Only minor annoyance is the expansion of directories when choosing the index directory in that it always scrolls back to the top, but heh, that is just a nit. Also, doesn't it have to ask me for permission to access my hard drive? Don't know about GWT, so maybe just my ignorance. At any rate, I had it up and running and browsing my index in minutes. And the Visualization of Zipfs law is really cool too, as is the vocab growth graph! Very nice work and I imagine it will only get better. +1 for adding it to contrib! -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: First cut at web-based Luke for contrib
Seriously cool! On Nov 28, 2007, at 6:13 PM, markharw00d wrote: Any takers to test this contrib layout before I commit it? http://www.inperspective.com/lucene/webluke.zip This is a (17MB) zip file which you can unzip to a new "webluke" directory under your copy of lucene/contrib and then run the usual Lucene Ant build ( or at least "ant build-contrib"). You should then find under build/contrib/webluke/WebLuke.war plus a Jetty-based server in build/contrib/webluke/dist which can be started using java -jar start.jar. A Luke webapp should then be available on The zip doesn't have a parent dir, so I would put it under contrib/ webluke http://localhost:8080/WebLuke I've tested building here on XP but want to check that the GWT compile task works ok for others as the Google compiler is packaged in a platform-specific windows jar. I've tested this ant build with the windows jar on a Linux box and all was OK so I'm guessing the platform-specific bits of the google dev tools are related to the browser choices used in their "hosted" dev mode rather than the Java- to-Javascript compiler. Works for me on OS X 10.5 using Java 1.5. Only minor annoyance is the expansion of directories when choosing the index directory in that it always scrolls back to the top, but heh, that is just a nit. Also, doesn't it have to ask me for permission to access my hard drive? Don't know about GWT, so maybe just my ignorance. At any rate, I had it up and running and browsing my index in minutes. And the Visualization of Zipfs law is really cool too, as is the vocab growth graph! Very nice work and I imagine it will only get better. +1 for adding it to contrib! -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly
[ https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546485 ] Michael Busch commented on LUCENE-588: -- True... a solution might be to have the queryparser map escaped chars to some unused unicode codepoints. Then the WildcardQuery could distinguish escaped chars. I'd guess that other classes, like FuzzyQuery might have the same problem? The advantage of such a char mapping is that we can keep the String API and don't have to add special APIs to the Query objects for the queryparser. > Escaped wildcard character in wildcard term not handled correctly > - > > Key: LUCENE-588 > URL: https://issues.apache.org/jira/browse/LUCENE-588 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.0.0 > Environment: Windows XP SP2 >Reporter: Sunil Kamath > > If an escaped wildcard character is specified in a wildcard query, it is > treated as a wildcard instead of a literal. > e.g., t\??t is converted by the QueryParser to t??t - the escape character is > discarded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly
[ https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478 ] [EMAIL PROTECTED] edited comment on LUCENE-588 at 11/28/07 3:27 PM: --- The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t?\?t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard. was (Author: [EMAIL PROTECTED]): The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t?\\?t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard. > Escaped wildcard character in wildcard term not handled correctly > - > > Key: LUCENE-588 > URL: https://issues.apache.org/jira/browse/LUCENE-588 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.0.0 > Environment: Windows XP SP2 >Reporter: Sunil Kamath > > If an escaped wildcard character is specified in a wildcard query, it is > treated as a wildcard instead of a literal. > e.g., t\??t is converted by the QueryParser to t??t - the escape character is > discarded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly
[ https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546479 ] Daniel Naber commented on LUCENE-588: - Also, the original report and my comment look confusing because Jira removes the backslash. Imagine a backslash in front of *one* of the question marks. > Escaped wildcard character in wildcard term not handled correctly > - > > Key: LUCENE-588 > URL: https://issues.apache.org/jira/browse/LUCENE-588 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.0.0 > Environment: Windows XP SP2 >Reporter: Sunil Kamath > > If an escaped wildcard character is specified in a wildcard query, it is > treated as a wildcard instead of a literal. > e.g., t\??t is converted by the QueryParser to t??t - the escape character is > discarded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly
[ https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478 ] [EMAIL PROTECTED] edited comment on LUCENE-588 at 11/28/07 3:27 PM: --- The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t?\\?t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard. was (Author: [EMAIL PROTECTED]): The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t?\?t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard. > Escaped wildcard character in wildcard term not handled correctly > - > > Key: LUCENE-588 > URL: https://issues.apache.org/jira/browse/LUCENE-588 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.0.0 > Environment: Windows XP SP2 >Reporter: Sunil Kamath > > If an escaped wildcard character is specified in a wildcard query, it is > treated as a wildcard instead of a literal. > e.g., t\??t is converted by the QueryParser to t??t - the escape character is > discarded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly
[ https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478 ] Daniel Naber commented on LUCENE-588: - The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t?\?t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard. > Escaped wildcard character in wildcard term not handled correctly > - > > Key: LUCENE-588 > URL: https://issues.apache.org/jira/browse/LUCENE-588 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Affects Versions: 2.0.0 > Environment: Windows XP SP2 >Reporter: Sunil Kamath > > If an escaped wildcard character is specified in a wildcard query, it is > treated as a wildcard instead of a literal. > e.g., t\??t is converted by the QueryParser to t??t - the escape character is > discarded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
First cut at web-based Luke for contrib
Any takers to test this contrib layout before I commit it? http://www.inperspective.com/lucene/webluke.zip This is a (17MB) zip file which you can unzip to a new "webluke" directory under your copy of lucene/contrib and then run the usual Lucene Ant build ( or at least "ant build-contrib"). You should then find under build/contrib/webluke/WebLuke.war plus a Jetty-based server in build/contrib/webluke/dist which can be started using java -jar start.jar. A Luke webapp should then be available on http://localhost:8080/WebLuke I've tested building here on XP but want to check that the GWT compile task works ok for others as the Google compiler is packaged in a platform-specific windows jar. I've tested this ant build with the windows jar on a Linux box and all was OK so I'm guessing the platform-specific bits of the google dev tools are related to the browser choices used in their "hosted" dev mode rather than the Java-to-Javascript compiler. I need to add Apache licenses to all the source yet and tidy some superfluous files but otherwise it feels just about ready to contribute. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546420 ] Michael Busch commented on LUCENE-584: -- Yes you're right, I ran the tests w/ code coverage analysis enabled, and the BitSetMatcher is fully covered. Good! > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Assignee: Michael Busch >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546393 ] Paul Elschot commented on LUCENE-584: - With the full patch applied, the following test cases use a BitSetMatcher: TestQueryParser TestComplexExplanations TestComplexExplanationsOfNonMatches TestConstantScoreRangeQuery TestDateFilter TestFilteredQuery TestMultiSearcherRanking TestPrefixFilter TestRangeFilter TestRemoteCachingWrapperFilter TestRemoteSearchable TestScorerPerf TestSimpleExplanations TestSimpleExplanationsOfNonMatches TestSort so I don't think it is necessary to provide seperate test cases. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Assignee: Michael Busch >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential bug in StandardTokenizerImpl
: Thanks to Shai Erera for traslating the discussion into the developers' : list. I am surprised about Chris Hostetter's response, as this issue was to clarify: i'm not saying that the current behavior is ideal, or even correct -- i'm saying the current behavior is the current behavior, and changing it could easily break existing indexes -- something that the Lucene upgrade contract does not allow... http://wiki.apache.org/lucene-java/BackwardsCompatibility specificly: if someone built an index with 2.2, that index needs to work when queried by an app running 2.3 .. if we change the StandardTokenizer to treat this differnetly, that won't work. In some cases, being backwards compatible is more important then being "correct" ... i'm not 100% certain that this is one of those cases, i'm just pointing out that there is more to this issue then just a one line patch to some code. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances
[ https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546353 ] Michael Busch commented on LUCENE-1061: --- Yonik, I remember that we talked briefly about a QueryFactory in Atlanta and you had some cool ideas. Maybe you could mention them here? > Adding a factory to QueryParser to instantiate query instances > -- > > Key: LUCENE-1061 > URL: https://issues.apache.org/jira/browse/LUCENE-1061 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3 >Reporter: John Wang > Fix For: 2.3 > > Attachments: lucene_patch.txt > > > With the new efforts with Payload and scoring functions, it would be nice to > plugin custom query implementations while using the same QueryParser. > Included is a patch with some refactoring the QueryParser to take a factory > that produces query instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546352 ] Doug Cutting commented on LUCENE-1044: -- > I think deprecating flush(), renaming it to commit() +1 That's clearer, since flushes are internal optimizations, while commits are important events to clients. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546312 ] Michael McCandless commented on LUCENE-1044: {quote} When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits. {quote} I think deprecating flush(), renaming it to commit(), and clarifying the semantics to mean that commit() flushes pending docs/deletes, commits a new segments_N, syncs all files referenced by this commit, and blocks until the sync is complete, would make sense? And, commit() would in fact commit even when autoCommit is false (flush() doesn't commit now when autoCommit=false, which is indeed confusing). {quote} Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications. So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit. {quote} OK, I will test the "sync only when committing a merge" approach for performance. Hopefully a foreground sync() is fine given that with ConcurrentMergePolicy that's already in a background thread. This would be a nice simplification. And I agree we should be vague about, and users should never rely on, precisely when Lucene has really committed (sync'd) the changes to disk. I'll fix the javadocs. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546309 ] Michael McCandless commented on LUCENE-1044: I modified the CFS sync case to NOT bother syncing the files that go into the CFS. I also turned off syncing of segments.gen. I also tested on a Windows Server 2003 box. New patched attached (still a hack just to test performance!) and new results. All tests are with the "sync every commit" policy: ||IO System||CFS sync||CFS nosync||CFS % slower||non-CFS sync||non-CFS nosync||non-CFS % slower|| |2 drive RAID0 Windows 2003 Server R2 Enterprise x64|250|244|2.6%|241|241|0.1%| |ReiserFS 6-drive RAID5 array Linux (2.6.22.1)|186|166|11.9%|145|142|2.0%| |EXT3 single internal drive Linux (2.6.22.1)|160|158|0.9%|142|135|4.8%| |4 drive RAID0 array Mac Pro (10.4 Tiger)|152|155|-2.4%|149|147|1.3%| |Win XP Pro laptop, single drive|408|398|2.6%|343|346|-1.1%| |Mac Pro single external drive|211|209|1.0%|167|149|12.4%| > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546306 ] Doug Cutting commented on LUCENE-1044: -- > But must every "automatic buffer flush" by IndexWriter really be a "permanent commit"? When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits. > People who upgrade will suddenly get much worse performance. Yes, that would be bad. Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications. So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546304 ] Mark Miller commented on LUCENE-1026: - Hey Shai, These fixes are great and I will incorporate them all. I worked this up very quickly based on other less general code I am using. While I have not yet used this code for a project, it will be the framework that I migrate to for future projects. It should see much more development then. I am very eager to add some Searcher warming for one. Also, the tests where whipped together quite quickly. I appreciate your efforts at cleaning them up. Buffing up the SearchServer code to production level will also be on my list. Thanks for your improvements -- if you do any more work, keep me posted. - Mark > Provide a simple way to concurrently access a Lucene index from multiple > threads > > > Key: LUCENE-1026 > URL: https://issues.apache.org/jira/browse/LUCENE-1026 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search >Reporter: Mark Miller >Priority: Minor > Attachments: DefaultIndexAccessor.java, > DefaultMultiIndexAccessor.java, IndexAccessor.java, > IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, > SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java > > > For building interactive indexes accessed through a network/internet > (multiple threads). > This builds upon the LuceneIndexAccessor patch. That patch was not very > newbie friendly and did not properly handle MultiSearchers (or at the least > made it easy to get into trouble). > This patch simplifies things and provides out of the box support for sharing > the IndexAccessors across threads. There is also a simple test class and > example SearchServer to get you started. > Future revisions will be zipped. > Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone
[ https://issues.apache.org/jira/browse/LUCENE-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546301 ] Mike Baroukh commented on LUCENE-1070: -- I agree that nobody is forced to use DateTools. I used my own version, of course. But the report is not for *me*. It's just because I thought It was a bug. I also know that for 2 same date, DateTools will return the same string. My case is this : I have Dates to index. When Indexing, my Date objects contains hour and minutes. when searching, date are typed by users without time. They are parsed with dd/MM/ pattern. Because of the round() documentation, I thought there would be no problem because I use "DAY" Resolution. RTFM is not always a good option. Finally, maybe it's not a bug, it's an architectural issue : when using long for date, timezone is lost. I continue to think that dateToString must take a Date for parameter. This way, there would be no more ambiguity. > DateTools with DAY resoltion dosn't work depending on your timezone > --- > > Key: LUCENE-1070 > URL: https://issues.apache.org/jira/browse/LUCENE-1070 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.2 >Reporter: Mike Baroukh > > Hi. > There is another issue, closed, that introduced a bug : > https://issues.apache.org/jira/browse/LUCENE-491 > Here is a simple TestCase : > DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm"); > Date d1 = df.parse("10/10/2008 10:00"); > System.err.println(DateTools.dateToString(d1, Resolution.DAY)); > Date d2 = df.parse("10/10/2008 00:00"); > System.err.println(DateTools.dateToString(d2, Resolution.DAY)); > this output : > 20081010 > 20081009 > So, days are the same, but with DAY resolution, the value indexed doesn't > refer to the same day. > This is because of DateTools.round() : using a Calendar initialised to GMT > can make that the Date given is on yesterday depending on my timezone . > The part I don't understand is why take a date for inputfield then convert > it to calendar then convert it again before printing ? > This operation is supposed to "round" the date but using simply DateFormat to > format the date and print only wanted fields do the same work, isn't it ? > The problem is : I see absolutly no solution actually. We could have a > WorkAround if datetoString() took a Date as inputField but with a long, the > timezone is lost. > I also suppose that the correction made on the other issue > (https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug > because it correct only for those who use date with a different timezone than > the local timezone of the JVM. > So, my solution : add a DateTools.dateToString() that take a Date in > parameter and deprecate the version that use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1026: --- Attachment: IndexAccessorFactory.java > Provide a simple way to concurrently access a Lucene index from multiple > threads > > > Key: LUCENE-1026 > URL: https://issues.apache.org/jira/browse/LUCENE-1026 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search >Reporter: Mark Miller >Priority: Minor > Attachments: DefaultIndexAccessor.java, > DefaultMultiIndexAccessor.java, IndexAccessor.java, > IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, > SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java > > > For building interactive indexes accessed through a network/internet > (multiple threads). > This builds upon the LuceneIndexAccessor patch. That patch was not very > newbie friendly and did not properly handle MultiSearchers (or at the least > made it easy to get into trouble). > This patch simplifies things and provides out of the box support for sharing > the IndexAccessors across threads. There is also a simple test class and > example SearchServer to get you started. > Future revisions will be zipped. > Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1026: --- Attachment: shai-IndexAccessor.zip > Provide a simple way to concurrently access a Lucene index from multiple > threads > > > Key: LUCENE-1026 > URL: https://issues.apache.org/jira/browse/LUCENE-1026 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search >Reporter: Mark Miller >Priority: Minor > Attachments: DefaultIndexAccessor.java, > DefaultMultiIndexAccessor.java, IndexAccessor.java, > IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, > SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java > > > For building interactive indexes accessed through a network/internet > (multiple threads). > This builds upon the LuceneIndexAccessor patch. That patch was not very > newbie friendly and did not properly handle MultiSearchers (or at the least > made it easy to get into trouble). > This patch simplifies things and provides out of the box support for sharing > the IndexAccessors across threads. There is also a simple test class and > example SearchServer to get you started. > Future revisions will be zipped. > Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1026: --- Attachment: (was: IndexAccessorFactory.java) > Provide a simple way to concurrently access a Lucene index from multiple > threads > > > Key: LUCENE-1026 > URL: https://issues.apache.org/jira/browse/LUCENE-1026 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search >Reporter: Mark Miller >Priority: Minor > Attachments: DefaultIndexAccessor.java, > DefaultMultiIndexAccessor.java, IndexAccessor.java, > IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, > SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java > > > For building interactive indexes accessed through a network/internet > (multiple threads). > This builds upon the LuceneIndexAccessor patch. That patch was not very > newbie friendly and did not properly handle MultiSearchers (or at the least > made it easy to get into trouble). > This patch simplifies things and provides out of the box support for sharing > the IndexAccessors across threads. There is also a simple test class and > example SearchServer to get you started. > Future revisions will be zipped. > Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads
[ https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546288 ] Shai Erera commented on LUCENE-1026: Hi I've downloaded the code and tried to run the tests, but I think there are some problems: 1. The delete() method in the test attempts to delete the directory, and not the underlying files. So in effect it does not do anything. 2. Some of the tests that start new threads don't wait for them (by calling join()). That of course causes some Accessors to be removed (after you call closeAllAccessors()), while those threads are sill running. I've fixed those issues in the test. I'd appreciate if you can take a look. Also, in IndexAccessorFactory I've found some issues: 1. I guess you wanted to have it as a Singleton - so I defined a private default constructor to prevent applications from instantiating it. 2. I modified the code of createAccessor to first lookup if an accessor for that directory already exists. It should save the allocation of DefaultIndexAccessor. 3. I modified the implementation of other methods to access the HashMap of accessors more efficiently. I'd appreciate if you can review my fixes. I'll attach them separately. > Provide a simple way to concurrently access a Lucene index from multiple > threads > > > Key: LUCENE-1026 > URL: https://issues.apache.org/jira/browse/LUCENE-1026 > Project: Lucene - Java > Issue Type: New Feature > Components: Index, Search >Reporter: Mark Miller >Priority: Minor > Attachments: DefaultIndexAccessor.java, > DefaultMultiIndexAccessor.java, IndexAccessor.java, > IndexAccessorFactory.java, MultiIndexAccessor.java, SimpleSearchServer.java, > StopWatch.java, TestIndexAccessor.java > > > For building interactive indexes accessed through a network/internet > (multiple threads). > This builds upon the LuceneIndexAccessor patch. That patch was not very > newbie friendly and did not properly handle MultiSearchers (or at the least > made it easy to get into trouble). > This patch simplifies things and provides out of the box support for sharing > the IndexAccessors across threads. There is also a simple test class and > example SearchServer to get you started. > Future revisions will be zipped. > Works pretty solid as is, but could use the ability to warm new Searchers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone
[ https://issues.apache.org/jira/browse/LUCENE-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546284 ] Alexei Dets commented on LUCENE-1070: - I'm not a Lucene developer but just wanted to comment from a user perspective - I found the current Lucene behavior 100% correct and this bug report wrong. First of all, AFAIK this doesn't have anything to do with the DAY precision - with any higher precision one can also get other day (and other hour), this is just how timezones work: conversion from one timezone to another changes time. But then during the search one should also use DateTools.dateToString and the search will work correctly. And after applying DateTools.stringToDate on search results you'll get the correct dates. Search with a DAY precision will search for the given day in UTC timezone not in a local one, if it is not sufficient for your purposes then you should use an HOUR precision during indexing and search - DAY is simply not precise enough for your purposes. Another alternative that should probably work (I never tried) is to create your Date (that you are passing to DateTools.dateToString) in UTC timezone, in this case no timezone conversion should be applied unless there is some bug in DateTools and you'll get exactly the same day indexed (but then on retrieving results DateTools.stringToDate will change you day because it'll apply the local timezone). And after all nobody is forced to use DateTools and can implement their own way to store dates if no timezone conversions are required - it is probably the best way for this _specific_ case. > DateTools with DAY resoltion dosn't work depending on your timezone > --- > > Key: LUCENE-1070 > URL: https://issues.apache.org/jira/browse/LUCENE-1070 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.2 >Reporter: Mike Baroukh > > Hi. > There is another issue, closed, that introduced a bug : > https://issues.apache.org/jira/browse/LUCENE-491 > Here is a simple TestCase : > DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm"); > Date d1 = df.parse("10/10/2008 10:00"); > System.err.println(DateTools.dateToString(d1, Resolution.DAY)); > Date d2 = df.parse("10/10/2008 00:00"); > System.err.println(DateTools.dateToString(d2, Resolution.DAY)); > this output : > 20081010 > 20081009 > So, days are the same, but with DAY resolution, the value indexed doesn't > refer to the same day. > This is because of DateTools.round() : using a Calendar initialised to GMT > can make that the Date given is on yesterday depending on my timezone . > The part I don't understand is why take a date for inputfield then convert > it to calendar then convert it again before printing ? > This operation is supposed to "round" the date but using simply DateFormat to > format the date and print only wanted fields do the same work, isn't it ? > The problem is : I see absolutly no solution actually. We could have a > WorkAround if datetoString() took a Date as inputField but with a long, the > timezone is lost. > I also suppose that the correction made on the other issue > (https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug > because it correct only for those who use date with a different timezone than > the local timezone of the JVM. > So, my solution : add a DateTools.dateToString() that take a Date in > parameter and deprecate the version that use a long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1058: Attachment: LUCENE-1058.patch Tee it is. And here I just thought you liked golf! I guess I have never used the tee command in UNIX. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546277 ] Michael McCandless commented on LUCENE-1044: {quote} I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often. {quote} But must every "automatic buffer flush" by IndexWriter really be a "permanent commit"? I do agree that when you close an IndexWriter, we should should do a "permanent commit" (and block until it's done). Even if we use that policy, the BG sync thread can still fall behind such that the last few/many flushes are still in-process of being made permanent (eg I see this happening while a merge is running). In fact I'll have to block further flushes if syncing falls "too far" behind, by some metric. So, we already won't have any "guarantee" on when a given flush actually becomes permanent even if we adopt this policy. I think "merge finished" should be made a "permanent commit" because otherwise we are tying up potentially alot of disk space, temporarily. But for a flush there's only a tiny amount of space (the old segments_N files) being tied up. Maybe we could make some flushes permanent but not all, depending on how far behind the sync thread is. EG if you do a flush, but, the sync thread is still trying to make the last flush permanent, don't force the new flush to be permanent? In general, I think the longer we can wait after flushing before forcing the OS to make those writes "permanent", the better the chances that the OS has in fact already sync'd those files anyway, and so the sync cost should be lower. So maybe we could make every flush permanent, but wait a little while before doing so? Regardless of what policy we choose here (which commits must be made "permanent", and, when) I think the approach requires that IndexFileDeleter query the Directory so that it's only allowed to delete older commit points once a newer commit point has successfully become permanent. I also worry about those applications that are accidentally flushing too often now. Say your app now sets maxBufferedDocs=100. Right now, that gives you poor performance but not disastrous, but I fear if we do the "every commit is permanent" policy then performance could easily become disastrous. People who upgrade will suddenly get much worse performance. > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: FSyncPerfTest.java, LUCENE-1044.patch, > LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546274 ] Yonik Seeley commented on LUCENE-1058: -- The SinkTokenizer name could make sense, but I think TeeTokenFilter makes more sense than SourceTokenFilter (it is a tee, it splits a single token stream into two, just like the UNIX tee command). > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1058: Attachment: LUCENE-1058.patch Whew. I think we are there and I like it! I renamed Yonik's suggestions to be SinkTokenizer and SourceTokenFilter to model the whole source/sink notion. Hopefully people won't think the SourceTokenFilter is for processing code. :-) I will commit tomorrow if there are no objections. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546255 ] Grant Ingersoll commented on LUCENE-1058: - Will do. Patch to follow shortly > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, > LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546236 ] Mark Miller commented on LUCENE-794: Michael: I would love to take a look. I've got the code you sent me and I will go through it soon. Mark: That is an issue that should probably be cleaned up. A lot of tests are shared, the new SpanScorer just requires some different, odd, setup that made it easier to copy and change the test file. I will spend some time trying to combine them into one test file to avoid the overlap. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: spanhighlighter.patch, spanhighlighter10.patch, > spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, > spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, > spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, > spanhighlighter_patch_4.zip > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546181 ] Mark Harwood commented on LUCENE-794: - Makes sense to commit it to me. I want to spend some time reviewing this in more detail once I'm through with contributing the new web-based version of Luke. At a quick glance, does the new Junit test in this patch encompass both old and new Highlighter tests? In which case should we remove the old Junit test if they overlap? > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: spanhighlighter.patch, spanhighlighter10.patch, > spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, > spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, > spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, > spanhighlighter_patch_4.zip > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-584: - Attachment: lucene-584.patch OK, here's a patch that compiles cleanly on current trunk and all tests pass. It includes: - all changes from Matcher-20071122-1ground.patch - util/BitSetMatcher.java from Matcher-20070905-2default.patch - Hits.java changes from Matcher-20070905-3core.patch - Filter#getMatcher() returns the BitSetMatcher Would you be up for providing testcases? As I said I haven't fully reviewed the patch, but I'm planning to do that soon. I can vouch that all tests pass after applying the patch. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Assignee: Michael Busch >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-584: Assignee: Michael Busch > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Assignee: Michael Busch >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546166 ] Paul Elschot commented on LUCENE-584: - The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility. I have made no performance tests on BitSetMatcher for two reasons. The first reason is that OpenBitSet is actually faster than BitSet (have a look at the graph in the SomeMatchers.zip file attachment by Eks Dev), so it seems to be better to go in that direction. The second is that it is easy to do the skipping in IndexSearcher on a BitSet directly by using nextSetBit on the BitSet instead of skipTo on the BitSetMatcher. For this it would only be necessary to check whether the given MatchFilter is a Filter. Anyway, I prefer to see where the real performance bottlenecks are before optimizing for performance. DefaultMatcher should be in the ...2default... patch. The change in Hits to use MatchFilter should be in the ...3core.. patch. So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]