[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes
[ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499044 ] John Haxby commented on LUCENE-888: --- > Net/net it's between 10-18% performance gain overall. It is > interesting that the system with the "weakest" IO system (one drive on > Windows XP vs RAID 0/5 on the others) has the best gains. Actually, it's not that surprising. Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account. The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k). You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k. I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k. In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer. I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help. I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though. > Improve indexing performance by increasing internal buffer sizes > > > Key: LUCENE-888 > URL: https://issues.apache.org/jira/browse/LUCENE-888 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless > Assigned To: Michael McCandless >Priority: Minor > > In working on LUCENE-843, I noticed that two buffer sizes have a > substantial impact on overall indexing performance. > First is BufferedIndexOutput.BUFFER_SIZE (also used by > BufferedIndexInput). Second is CompoundFileWriter's buffer used to > actually build the compound file. Both are now 1 KB (1024 bytes). > I ran the same indexing test I'm using for LUCENE-843. I'm indexing > ~5,500 byte plain text docs derived from the Europarl corpus > (English). I index 200,000 docs with compound file enabled and term > vector positions & offsets stored plus stored fields. I flush > documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to > not hit LUCENE-845. The resulting index is 1.7 GB. The index is not > optimized in the end and I left mergeFactor @ 10. > I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO > system. > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if > I increase both buffers to 8 KB it takes 554 sec to build the index, > which is an 11% overall gain! > I will run more tests to see if there is a natural knee in the curve > (buffer size above which we don't really gain much more performance). > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > at 1024, at least for now. During searching there can be quite a few > of this class instantiated, and likely a larger buffer size for the > freq/prox streams could actually hurt search performance for those > searches that use skipping. > The CompoundFileWriter buffer is created only briefly, so I think we > can use a fairly large (32 KB?) buffer there. And there should not be > too many BufferedIndexOutputs alive at once so I think a large-ish > buffer (16 KB?) should be OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [att: pmc] [off topic] ezmlm and reply-to
Steven Rowe wrote: If you do want to add a reply-to list header, put ``reply-to'' into DIR/headerremove, and ``Reply-To: [EMAIL PROTECTED]'' into DIR/headeradd. My guess, given that the ``Reply-To: [EMAIL PROTECTED]'' header is already inserted into the header, is that putting ``reply-to'' into DIR/headerremove will fix the problem for you. I think that that is necessary because ... Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm List-Id: Reply-To: java-dev@lucene.apache.org Date: Sun, 23 Jul 2006 19:10:47 +0200 From: "Simon Willnauer" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: java-dev@lucene.apache.org Subject: Gdata - opening/closing index .. that's not a legal header. RFC2822 says that you can only have one Reply-To: header. If the mailing list manager isn't deleting the original then it really should be merging them (you can have more than one address). The fact that some mailers choose the first Reply-To: and some choose the second (or last) is not the problem -- if you don't have a legal header then any interpretation is reasonable. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
DM Smith wrote: I simply meant that the change that is being made should be done in such a way that one applying the patch can readily see what is being changed. The most common case of unnecessary change is that of whitespace. Changing indentation, changing the placement of curly braces, reordering methods and variables and so forth are all unnecessary. [snip] Such a change is most likely unnecessary. Others, probably including me, would disagree. Changes to make the source have a consistent style and a consistent layout are not uncommon. Look through the Linux kernel change logs for "whitespace clean up" (or "white space" and "cleanup", spaces are optional :-)). The GNU glibc maintainers will reject patches that do not conform to the coding style for glibc -- and that includes stylistic choices like the ones you mentioned (that I cut in the interests of brevity). Style may make no functional difference to the code but it does affect maintainability. It may well also affect correctness. You could declare all your variables as "Object" and simply cast to the right type to get the method you want. There would be no functional difference (one could argue that eliminating run-time type checking is merely an optimisation) but would you seriously want to code this way. Similarly, and I'm struggling to keep vaguely on-topic here, the Java 1.5 iteration constructs are functionally no different to their 1.4 equivalent. But to dismiss the 1.5 changes as "syntactic sugar" or "fluff" is to denigrate their importance to the reliability and maintenance of software. If you declared all your variables as "Object" would your code be more reliable, about the same or less? (That's a rhetoric question, I hope.) jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
DM Smith wrote: Generally open source projects have a policy to change as little of the file as possible, only changing what is necessary. Hmmm. Necessary by what criterion? Necessary to make, say, Lucene exploit the new interator constructs to avoid run-time type-checking? Necessary to make the code more readable? Necessary to prevent use with Java 1.4? :-) I'm not sure I've ever seen a policy expressed in that way -- patches generally should be clear, concise and do what they're intended to do, but that doesn't necessarily mean minimising the size of the patch and it doesn't necessarily mean keeping the source compatible with some old compiler or environment. Indeed, to be hypothetical and not entirely off-topic, it easy to imagine two patches that are better than one. For the two patches, reorganise a class so that it exploits Java 1.5 features and the "real" patch that uses that new structure to cleanly and elegantly implement some new feature. For the one patch, leave the code compatible with 1.4, but the functional patch is now much larger, more complex and harder to verify. It's possible that such a hypothetical patch (or pair of patches) is at the core of the question of this thread. Does one embrace new language features because they provide some tangible benefit to the maintainability, functionality or complexity of the code? If so, for how long? Should Lucene development freeze at 1.4 until there's no working hardware that runs 1.4? Would that also preclude changes to Lucene that make it work dramatically better on a machine with the current de-facto "standard" memory? What happens when that's, say, 4Gb and some old hardware simply won't let you install that much memory? Would it not be better to freeze application development that needs an old environment and simply back-port bug fixes and, where it makes sense, functionality to the version of Lucene that is used in that environment? The approach taken by Red Hat with their Enterprise Linux series is that they'll support a version of the platform for several years, back-porting bug fixes, adding small, incremental functional changes and so on. That means that this antique computer that happily runs RHEL3 will be able to carry on running an OS and applications that work on that hardware until it finally gives p the ghost . jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Robert Engels wrote: To set the record straight, I think the Lucene product and community are fantastic. Period. Ditto. [snip] After almost 2 years I now back the move. Why? Several reasons: 1. Sun is very slow, if at all to fix bugs in 1.4 (of which there are many). For example, the current problems in Lucene regarding ThreadLocals. Although this is not a bug per se, it is probably not intuitive or desired behavior. The Lucene developers have been forced to both diagnose and create workarounds "problems" already fixed in 1.5. The licensing of Java does not allow for the easy fix bugs by non-Sun developers. 2. The type safe collections are far more efficient to program/debug with. 3. The standardized concurrent facilities can be of great benefit to multithreaded programs. 4. It is what students graduating from college understand and use. 5. It is what the currently available books explain and use. For my money (2) is the most important reason for moving to 1.5. My early years (!) involved a lot of work with programming languages and finally having type-safe collections and the syntactic conventions that go along with those immediately struck me as a good step forward. That first impression was borne out when I converted some of my java code to use the 1.5 type-safe constructs. Not only was the code shorter and more understandable (aka more maintainable) but it brought to light some bugs that had lain their dormant for quite a while. Lucene is well-tested and stable and written by better people better at writing Java than me so it's unlikely that there'll be any bugs like that lurking in dark corners. On the other hand, new code stands a better chance of being bug free precisely because of the improvements in the language and the improved type-safety. (3) comes second for me though -- I'm a big fan of Doug Lea's util.concurrent classes and having them well integrated in Java 1.5 makes them even better, but that's the operating system personality talking. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcard search with variable length
Doug Cutting wrote: DM Smith wrote: Personally, I don't want an either/or. I want a both/and. Modern unix shells provide both/and, albeit with different syntax. I see this more as a feature request than an argument as to the usefulness or properness of either. Both are useful. Both are proper. Both are intuitive. Both are counterintuitive. It all depends on your "tradition". +1 Doug Doesn't the RegexQuery do this for you? jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcard search with variable length
Andrzej Bialecki wrote: Tiago Silveira wrote: IMHO, using "cat cat?" or even "cat cat? cat??" is so simple that it doesn't justify keeping the old, undocumented, arguably incorrect behavior. I have a different view on this issue - IMHO treating "?" as "exactly one character" is counterintuitive for people familiar with the use of wildcards: in all popular regular expression languages, and also in DTD/XML world, a single "?" metacharacter means "zero or one", which is probably why the original behavior was introduced (or at least it was more compatible with the use of "?" in other contexts). Ahh. Well. If "cat?" is a regular expression then it will match "ca" and "cat". "cat??" is probably not a valid regular expression: the final ? means "one or zero occurances of t?" which means that it too matches "ca" and "cat". However, the javadoc defines "?" and its definition matches the shell glob definition and it's quite clear that WildcardQuery is not a RegexQuery just from the docs. I can't comment about the wildcard character a DTD/XML context, I'm not that familiar with it. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcard search with variable length
Tiago Silveira wrote: IMHO, using "cat cat?" or even "cat cat? cat??" is so simple that it doesn't justify keeping the old, undocumented, arguably incorrect behavior. I don't think there's any question of the old behaviour being incorrect -- the javadoc says that ? matches a single character, not zero or one characters, a single character. On the other hand, does Erik's new RegexQuery support "cat.?" (the ".?" does match zero or one characters)?(Where's the javadoc for that? I don't see any comments in the source, let alone anything else :-)) jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-497) update copyright (and licence) prior to release of 1.9
[ http://issues.apache.org/jira/browse/LUCENE-497?page=comments#action_12366465 ] John Haxby commented on LUCENE-497: --- It's not as if I'm a lawyer either or what I say is likely to carry much weight, but what Yonik and Erik say matches what the legal people at my previous employer (HP) said -- don't change the copyright date unless the file has changed. The underlying reason, it was explained to me, is that if you blindly claim copyright for years when you didn't do anything then a judge (if it came down to that) is going to take the view that your copyright notices don't actually have much value. It's not, they said, the actual form of the copyright notice its convincing a judge that you do hold the copyright. To that end, I asked, a good copyright notice is a help and a bad copyright notice is a hindrance. Yes, came the answer. > update copyright (and licence) prior to release of 1.9 > -- > > Key: LUCENE-497 > URL: http://issues.apache.org/jira/browse/LUCENE-497 > Project: Lucene - Java > Type: New Feature > Reporter: Hoss Man > Priority: Minor > > As discussed in email earlier today, it wouldn't hurt to update the Copyright > on all of the source files before release 1.9. > Rather then try to submit a path with all the changes, here's a oneliner that > should work on any unix box to update in mass. If it sees a Copyright string > it recognizes, it preserves the start year and adds/replaces the end year... > find -name \*.java | xargs perl -pi -e 's/Copyright (\(c\) > )?(200[0-5])(-\d+)? (The )?Apache Software Foundation/Copyright ${2}-2006 The > Apache Software Foundation/;' > ...it would make sense for someone with commit permissions to run that > themselves. > It also cleans up a few that have a " (c) " in them that doesn't seem > standard across the rest of the files, and makes sure that the ASF is refered > to as "The" ASF. > Even after all that, there are a few that may need cleaned up by hand... > ./src/test/org/apache/lucene/store/TestLock.java: * Copyright (c) 2001,2004 > The Apache Software Foundation. All rights > ./src/test-deprecated/org/apache/lucene/index/DocHelper.java: * Copyright > 2004. Center For Natural Language Processing > ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseAnalyzer.java: > * Copyright: Copyright (c) 2001 > ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseFilter.java: > * Copyright:Copyright (c) 2001 > ./contrib/analyzers/src/java/org/apache/lucene/analysis/cn/ChineseTokenizer.java: > * Copyright: Copyright (c) 2001 > ...the first is just an anoying format, the rest either have non ASF > copyrights, or dual copyrights (!?) > It also may be a good time to take a look at all the (non-JavaCC generated) > java files that don't mention the Apache License, Version 2.0 ... > @asimov:~/svn/lucene/java$ find src -name \*.java | xargs grep -L "Generated > By:JavaCC" | xargs grep -L LICENSE-2.0 > src/java/org/apache/lucene/search/SortComparatorSource.java > src/java/org/apache/lucene/search/SortComparator.java > src/test/org/apache/lucene/index/TestTermVectorsReader.java > src/test/org/apache/lucene/index/TestSegmentTermEnum.java > src/test/org/apache/lucene/index/TestFieldInfos.java > src/test/org/apache/lucene/index/TestIndexWriter.java > src/test/org/apache/lucene/store/TestLock.java > src/test/org/apache/lucene/store/_TestHelper.java > src/test/org/apache/lucene/search/TestRangeQuery.java > src/test/org/apache/lucene/TestHitIterator.java > src/test/org/apache/lucene/document/TestBigBinary.java > src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java > src/test-deprecated/org/apache/lucene/index/TestTermVectorsReader.java > src/test-deprecated/org/apache/lucene/index/store/FSDirectoryTestCase.java > src/test-deprecated/org/apache/lucene/index/TestSegmentTermEnum.java > src/test-deprecated/org/apache/lucene/index/DocHelper.java > src/test-deprecated/org/apache/lucene/index/TestIndexWriter.java > src/test-deprecated/org/apache/lucene/search/TestRangeQuery.java -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r375070
Daniel Naber wrote: On Sonntag 05 Februar 2006 19:45, Pasha Bizhan wrote: Does this patch require to reindex all data? URL: http://svn.apache.org/viewcvs?rev=375070&view=rev Log: DateTools needs to use UTC for correct collation (LUCENE-491), patch by John Haxby If your timezone is not UTC and your dates need to be accurate to the hour, then yes. That's true, but be aware that, for example, Monday, 6pm PST is Tuesday, 2am, GMT.Even if you have Resolution.YEAR then the last few hours of 2005 in California are actually 2006 GMT. If you're worried about events crossing date boundary then you'll need to re-index. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-491) DateTools needs to use UTC for correct collation,
[ http://issues.apache.org/jira/browse/LUCENE-491?page=all ] John Haxby updated LUCENE-491: -- Attachment: patch Patch for problem. Basically, whereever a timezone can be used, we use GMT. > DateTools needs to use UTC for correct collation, > - > > Key: LUCENE-491 > URL: http://issues.apache.org/jira/browse/LUCENE-491 > Project: Lucene - Java > Type: Bug > Versions: CVS Nightly - Specify date in submission > Environment: svn trunk at 02-Feb-2005, noon GMT. OS independent. > Reporter: John Haxby > Attachments: patch, testcase.java > > If your local timezone is Europe/London then the times Sun, 30 Oct 2005 > 00:00:00 + and exactly one hour later are both converted to 20053001 > by DateTools.dateToString() with minute resolution. The Linux date command > is useful in seeing why: > $ date --date "Sun, 30 Oct 2005 00:00:00 +" > Sun Oct 30 01:00:00 BST 2005 > $ date --date "Sun, 30 Oct 2005 01:00:00 +" > Sun Oct 30 01:00:00 GMT 2005 > Both times are 1am in the morning, but one is when DST is in force, the other > isn't. Of course, these are actually different times! > Of course, if dates are stored in the index with implicit timezone > information then not only do we get problems when the clocks go back at the > end of summer, but we also have problems crossing timezones. If a database > is created in California and used in Paris then the times are going to be > badly skewed (there's a nine hour time difference most of the year). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-491) DateTools needs to use UTC for correct collation,
[ http://issues.apache.org/jira/browse/LUCENE-491?page=all ] John Haxby updated LUCENE-491: -- Attachment: testcase.java TestCase for problem. > DateTools needs to use UTC for correct collation, > - > > Key: LUCENE-491 > URL: http://issues.apache.org/jira/browse/LUCENE-491 > Project: Lucene - Java > Type: Bug > Versions: CVS Nightly - Specify date in submission > Environment: svn trunk at 02-Feb-2005, noon GMT. OS independent. > Reporter: John Haxby > Attachments: testcase.java > > If your local timezone is Europe/London then the times Sun, 30 Oct 2005 > 00:00:00 + and exactly one hour later are both converted to 20053001 > by DateTools.dateToString() with minute resolution. The Linux date command > is useful in seeing why: > $ date --date "Sun, 30 Oct 2005 00:00:00 +" > Sun Oct 30 01:00:00 BST 2005 > $ date --date "Sun, 30 Oct 2005 01:00:00 +" > Sun Oct 30 01:00:00 GMT 2005 > Both times are 1am in the morning, but one is when DST is in force, the other > isn't. Of course, these are actually different times! > Of course, if dates are stored in the index with implicit timezone > information then not only do we get problems when the clocks go back at the > end of summer, but we also have problems crossing timezones. If a database > is created in California and used in Paris then the times are going to be > badly skewed (there's a nine hour time difference most of the year). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-491) DateTools needs to use UTC for correct collation,
DateTools needs to use UTC for correct collation, - Key: LUCENE-491 URL: http://issues.apache.org/jira/browse/LUCENE-491 Project: Lucene - Java Type: Bug Versions: CVS Nightly - Specify date in submission Environment: svn trunk at 02-Feb-2005, noon GMT. OS independent. Reporter: John Haxby If your local timezone is Europe/London then the times Sun, 30 Oct 2005 00:00:00 + and exactly one hour later are both converted to 20053001 by DateTools.dateToString() with minute resolution. The Linux date command is useful in seeing why: $ date --date "Sun, 30 Oct 2005 00:00:00 +" Sun Oct 30 01:00:00 BST 2005 $ date --date "Sun, 30 Oct 2005 01:00:00 +" Sun Oct 30 01:00:00 GMT 2005 Both times are 1am in the morning, but one is when DST is in force, the other isn't. Of course, these are actually different times! Of course, if dates are stored in the index with implicit timezone information then not only do we get problems when the clocks go back at the end of summer, but we also have problems crossing timezones. If a database is created in California and used in Paris then the times are going to be badly skewed (there's a nine hour time difference most of the year). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: access rights
I'd suggest using a QueryFilter. What happens when the access rights for a document or a collection of documents change?How do you deal with new users? I'm asking because I'm looking at the same problem but rather than attempting to keep the access rights in the index consistent with the access rights in the original store, I'm looking at using a QueryFilter that checks the original document for access. It's slow, but it can be cached. This question probably belongs on java-user though. jch Maros Ivanco wrote: Hi, I try to implement acces rights mechanism on the top of the lucene. My situation looks like this: Indexed documents have associated access rights information. When I construct the query, I append a part, which matches actual user identity with access rights in the documents. This way the user gets only the documents s/he can really access, and the number of hits is really the number of documents s/he can potentionally access. The approach works (it respects access rights), but the access rights (AR) query part also affects the score of the documents. I tried two approaches to avoid the effect of AR query part. First, I tried to set boost factor of the AR fields during the document indexation to zero. This way I was unable to get any results. Next, I set boost factor of the AR fields to small number (0.001). This way I get the results but the computed score is really small (less then 1%) for the first document in the results. So, is there any possibility to effectivelly exclude certain fields from score computation? Any idea regarding the access rights issue, suggestion for better approach, ... is welcome. Maros. P.S. I found this post on the user list: http://www.gossamer-threads.com/lists/lucene/java-user/14973?do=post_view_threaded My approach is the number 3 in the post, but unfortunately no reply deal with it. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-489) Wildcard Queries with leading "*"
[ http://issues.apache.org/jira/browse/LUCENE-489?page=comments#action_12363822 ] John Haxby commented on LUCENE-489: --- I'm sure someone mentioned on one of the lists a while back, but there's a technique that we used for an LDAP server that's applicable here. It's a bit like injecting synonyms: you'd have, say, a SubwordFilter that given "brown" would emit "rown" and "own" at the same position. A "*own" query would then simply drop the leading wildcard and look for the word. We stopped at three letters in the LDAP server. An alternative is to use a ReverseAlternativeFilter (say) that emits "brown" and "nworb" at the same position, but that only deals with prefix or postfix wildcards, but not both. I'm not sure how you'd stop "own" matching "brown" though. If someone could come up with some example code I don't suppose I'd be the only one who would be interested! > Wildcard Queries with leading "*" > - > > Key: LUCENE-489 > URL: http://issues.apache.org/jira/browse/LUCENE-489 > Project: Lucene - Java > Type: Wish > Components: QueryParser > Reporter: Peter Schäfer > > It would be nice to have wildcard queries with a leading wildcard ("?" or > "*"). > I'm aware that this is a well-known issue, and I do understand the reasons > behind it, > but try explaining that to our end-users ... :-( -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-489) Wildcard Queries with leading "*"
Peter Schäfer (JIRA) wrote: It would be nice to have wildcard queries with a leading wildcard ("?" or "*"). I'm aware that this is a well-known issue, and I do understand the reasons behind it, but try explaining that to our end-users ... :-( I'm sure someone mentioned this a while back, but there's a technique that we used for an LDAP server that's applicable here. It's a bit like injecting synonyms: you'd have, say, a SubwordFilter that given "brown" would emit "rown" and "own" at the same position. A "*own" query would then simply drop the leading wildcard and look for the word. We stopped at three letters in the LDAP server. An alternative is to use a ReverseAlternativeFilter (say) that emits "brown" and "nworb" at the same position, but that only deals with prefix or postfix wildcards, but not both. I'm not sure how you'd stop "own" matching "brown" though. If someone could come up with some example code I don't suppose I'd be the only one who would be interested! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Inconsistency between MultiFieldQueryParser and QueryParser
Daniel Naber wrote: These are just simple convenience methods that create a BooleanQuery. Making them non-static would create a different set of problems, e.g that you need to pass them an array with the same number of elements as the constructor was given. So I don't know if this is something that should be changed. Ahh. Sorry, I think I misunderstood. I think I wanted to do qp = new MutliFieldQueryParser(new String[]{"subject", "text"}, new BooleanClause.Occur[]{MUST, SHOULD}, analyzer); qp.setLocale(locale); qp.setSlop(slop); and I'm no longer sure that that has any worthwhile benefit. I think I was also thinking that other combinations of occurances would be useful but, this morning and for the life of me, I can't see why. I think having both fields required was possibly what I was thinking of. Maybe I've just lost my marbles :-) jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Inconsistency between MultiFieldQueryParser and QueryParser
Hello All, Not sure if this should be user, dev or a bug report. Apologies if this is the wrong message to the wrong place! Happy to correct it if needed. QueryParser's static parse() method is deprecated, but MultiFieldQueryParser has three static parse() methods, moreover there's constructor that takes a BooleanClause.Occur[] and no non-static method that takes a String[] queries. Am I missing something here? I have to admit I haven't looked at the code to see what's going on. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NioFile cache performance
Robert Engels wrote: Using a 4mb file (so I could be "guarantee" the disk data would be in the OS cache as well), the test shows the following results. Which OS? If it's Linux, what kernel version and distro? What hardware (disk type, controller etc). It's important to know: I/O (and caching) is very different between Linux 2.4 and 2.6. The choice of I/O scheduler can also make a significant difference on 2.6, depending on the workload. The type of disk and its controller is also important -- and when you get really picky, the mobo model number. I don't dispute your finding for a second, but it would be good to run the same test on other platforms to get comparative data: not least because you can get the kind of I/O time improvement you're seeing on some workloads on different versions of the Linux kernel. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: "Advanced" query language
Yonik Seeley wrote: I looked into this a year ago... most scripting languages have an emphasis on script execution speed, not script parsing speed (which is what we would need). The scripting languages I tried were horribly slow at parsing a small script. The only one that could parse at a reasonable speed was rhino (javascript) in interp mode. I've always found the lisp syntax very easy to parse. In this case, it's just prefix with the nam of he operator being first in the list, eg (and "eggs" "oranges"). There are wrinkles for named and optional parameters, but the basic syntax is a doddle. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Version 1.9
John Haxby wrote: [...] compiled with gcj that I believe is compiled with gcj [...] It's only compiled once with gcj, if at all :-) You can get it from http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/SRPMS/lucene-1.4.3-1jpp_3fc.src.rpm A quick inspection of the .spec file suggests that it's compiled with gcj and, indeed, the compiled RPM (http://download.fedora.redhat.com/pub/fedora/linux/core/updates/4/i386/lucene-1.4.3-1jpp_3fc.i386.rpm) has both shared libraries and a jar. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Version 1.9
Jeff Breidenbach wrote: 2) Is anyone testing against kaffe or other non-sun compilers? This is important to Debian as any software that can only be built from a closed-source JDK is considered a second class citizen. As you can see, we've been poking at this issue on Lucene 1.4.3 for quite some time [1] and it is tricky. Support from upstream is always appreciated. I don't know much about it, but Fedora Core 4 ships a version of Lucene compiled with gcj that I believe is compiled with gcj. It'd be easy enough to see what they do in the .src.rpm though and as Debian maintainer I'm sure you're well versed in getting ideas from other builds! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]