Re: ParallelReader
Doug Cutting wrote: Please find attached something I wrote today. It has not been yet tested extensively, and the documentation could be improved, but I thought it would be good to get comments sooner rather than later. Would folks find this useful? My Answer: "Is the Pope German?" Very useful for something we're about to start, where we have millions of construction documents, with meta-data and content. Most searching is done against the meta-data (and must be _fast_), but occasionally people want to be able to look inside the file contents, so I can see the 1% searching inside the content index in this case. Should it go into the core or in contrib? +1 to core... (non-binding of course). Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performanc]
At this rate I'm only getting on average 300-400-ish items/second added to the index. I think that's realistic for typical uses of Lucene on common hardware. Thanks Daniel, that's at least comforting to know that it's at least expected. Can you or anyone else comment on the CPU profile I sent in? If there was a way of optimizing that loop, then it could mean a reasonable improvement in indexing speed. cheers, Paul Smith
[Performance]: IndexWriter again...
Ok, I'm just following up on my email from 29th April titled '[Performanc]' (don't you love it when you send before you've typed your subject line completely). The thread is here:http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200504.mbox/[EMAIL PROTECTED]In summary, I still firmly believe that the IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would be ideal. So I ran a simple test. I ran the same test I've done before, using mergeFactor(1000) maxBufferedDocs(1), useCompondFile(false), indexing 5 fields (user first/lastname/email address)As a baseline using the latest SVN source code, I'm getting an indexing rate of between 490-515 items/second of a number of runs.By applying the attached simple patch to IndexWriter, I'm getting between 945-970 of a number of test runs. That's a significant speed up. All the patch is doing is deferring the call to maybeMergeSegments so it only does it every 2000 iterations (2000 is totally arbitrary on my part).I've verified with Luke that the index generated contains the same # documents, and same # terms, but I have not had a chance to properly setup my local environment to run the test cases. Obviously the attached patch is a dirty hack of the highest order. In my case I'm re-indexing from scratch every time, so there may be a reason why we shouldn't be doing this sort of deferring of method calls. Perhaps the source code is optimized around incremental/batch updates to _existing_ indexes, but creating a new index, but with a penalty of creating a new index performs slower than one would like.Perhaps IndexWriter could benefit from another setting that lets one configure how often to call maybeMergeSegments()? That could of course confuse more people than it helps.I would really appreciate anyones thoughts on this, I'll be very happy to be proven wrong because it will just help me understand more of Lucene. I would hope that speeding up indexing would benefit everyone? Particularly the large scale sites out there.cheers,Paul Smith IndexWriter.patch Description: Binary data
Re: [Performance]: IndexWriter again...
Silly me, here's the patch with the extra code NOT commented out... Oh my, how embarrassing... :) Paul On 16/05/2005, at 4:15 PM, Paul Smith wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance]: IndexWriter again...
I'm not even going to say anything this time :-$ On 16/05/2005, at 4:17 PM, Paul Smith wrote: Silly me, here's the patch with the extra code NOT commented out... Oh my, how embarrassing... :) Paul On 16/05/2005, at 4:15 PM, Paul Smith wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance]: IndexWriter again...
something very odd is going on with my attachments... sorry for the spam. On 16/05/2005, at 4:22 PM, Paul Smith wrote: I'm not even going to say anything this time :-$ On 16/05/2005, at 4:17 PM, Paul Smith wrote: Silly me, here's the patch with the extra code NOT commented out... Oh my, how embarrassing... :) Paul On 16/05/2005, at 4:15 PM, Paul Smith wrote: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance]: IndexWriter again...
On 16/05/2005, at 5:00 PM, Paul Elschot wrote: On Monday 16 May 2005 08:24, Paul Smith wrote: something very odd is going on with my attachments... sorry for the spam. It's usually easier open a bug in bugzilla and post the code and the concerns there. The only disadvantage of bugzilla is that you can only add attachment after the bug is opened for the first time: http://issues.apache.org/bugzilla/enter_bug.cgi Thanks Paul, I'm not sure why subsequent attempts are still stripping the attachment, I'll go ahead and file something in bugzilla, and cross my fingers I don't look any sillier than I do now. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Map-Reduce
I've been reading the Nutch MapReduce stuff[1], and the original Google paper [2]. I know there's a mapreduce branch in the nutch project, but is there any plan/talk of perhaps integrating something like that directly into the Lucene API? For projects that need a lower-level API like Lucene, rather than the crawl-like nature of Nutch, the potential to index lots of information in an efficient manner is very appealing indeed. I'm not suggesting this is _easy_, just curious of what folks on the Lucene-side of things think. Perhaps a chance to refactor out from nutch a shared library? I would love to hear anyones thoughts on the matter. cheers, Paul Smith [1] http://wiki.apache.org/nutch-data/attachments/Presentations/ attachments/oscon05.pdf [2] http://labs.google.com/papers/mapreduce-osdi04.pdf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Map-Reduce
On 05/08/2005, at 4:10 AM, Doug Cutting wrote: Doug Cutting wrote: Perhaps we need to factor Nutch into two projects, one with NDFS and MapReduce and the other with the search-specific code. This falls almost exactly on package lines. The packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} are not dependent on the rest of Nutch. FYI, over on the nutch-dev list, I just proposed that we split these packages into a new project that Nutch then depends on, since there seems to be interest in using them independently of Nutch. Such a split probably wouldn't happen for at least a month. http://www.mail-archive.com/nutch-dev%40lucene.apache.org/ msg00312.html Awesome, thanks Doug! I really believe that having this out as a separate project will be more useful for everyone. This will also give more exposure to Nutch and Lucene as a whole, because people will experiment with the NDFS/MapReduce stuff first (smaller thing to comprehend first). cheers, Paul
Re: Considering lucene
This requirement is almost exactly the same as my requirement for the log4j project I work on where I wanted to be able to index every row in a text log file to be it's own Document. It works fine, but treating each line as a Document turns out to take a while to index (searching is fantastic though I have to say) due to the cost of adding a Document to an index. I don't think Lucene is currently tuned (or tunable) to that level of Document granularity, so it'll depend on your requirement of timeliness of the indexing. I was hoping (of course it's a big ask) to be able to index a million rows of relatively short lines of text (as log files tend to be) in a 'few moments", no more than 1 minute, but even with pretty grunty hardware you run up against the bottleneck of the tokenization process (the StandardAnalyzer is not optimal at all in this case because of the way it 'signals' EOF with an exception). There was someone (apoligise, I've forgotten his name, I blame the holiday I just came back from) that could treat a relatively small file, such as an XML file, and very quickly index that for on the fly XPath like queries using Lucene which apparently works very well, but I'm not sure it scales to massive documents such as log files (and your requirements). cheers, Paul Smith On 30/09/2005, at 3:17 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: Hi, My name is Palmik Bijani and I have recently started a new software company in India. After initial research, Lucene has surfaced as a leading contender for our needs. We have also purchased the Lucene book which we are expecting in a couple weeks. However, I was hoping to get an answer to the following as we are unable to find this information from everything we have read so far on Lucene. We don’t know if the book covers this requirement of ours. Our requirement is for row based keyword search in a single very large text file which can potentially hold millions of rows (with delimited fields per row). In other words, we would like Lucene to filter and return only the row numbers within a file for the respective row that hold the keywords we query for a particular field in each row. From everything we have seen so far, Lucene can handle a large set of files and tokenizes the keywords within each file and returns the matching file name per keyword – but I have not seen anything about segmenting and searching by rows. From Lucene’s context, one can think of each row as a separate file, field data within each row as document content, and each row number as the unique file name. From what I have read about Lookoutsoft had used Lucene for Outlook email searches, it seems to me that it should be possible as fundamentally even email searching is row based. Is our requirement something that Lucene can inherently handle well, or would it require extensive tweaking and code changes on our end? Your response is greatly appreciated. Thank you, Palmik -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.344 / Virus Database: 267.11.9/115 - Release Date: 9/29/2005 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering lucene
On 01/10/2005, at 6:30 AM, Erik Hatcher wrote: On Sep 30, 2005, at 1:26 AM, Paul Smith wrote: This requirement is almost exactly the same as my requirement for the log4j project I work on where I wanted to be able to index every row in a text log file to be it's own Document. It works fine, but treating each line as a Document turns out to take a while to index (searching is fantastic though I have to say) due to the cost of adding a Document to an index. I don't think Lucene is currently tuned (or tunable) to that level of Document granularity, so it'll depend on your requirement of timeliness of the indexing. There are several tunable indexing parameters that can help with batch indexing. By default it is mostly tuned for incremental indexing, but for rapid batch indexing you may need to tune it to merge less often. Yep, mergeFactor et al. We currently have it at 1000 (with 8 concurrent threads creating Project-based indices, so that could be 8000 open files during search, unless I'm mistaken), plus increased the value for maxBufferedDocs as per standard practices. I was hoping (of course it's a big ask) to be able to index a million rows of relatively short lines of text (as log files tend to be) in a 'few moments", no more than 1 minute, but even with pretty grunty hardware you run up against the bottleneck of the tokenization process (the StandardAnalyzer is not optimal at all in this case because of the way it 'signals' EOF with an exception). Signals EOF with an exception? I'm not following that. Where does that occur? See our recent YourKit "sampling" profile export here: http://people.apache.org/~psmith/For%20Lucene%20list/ IOExceptionProfiling.html This is a full production test run over 5 hours indexing 6.5 million records (approx 30 fields) running on Dual P4 Xeon servers with 10K SCSI disks. You'll note that a good chunk (35%) of the time of the indexing thread is spent in 2 methods of the StandardTokenizerManager. When you look at the source code for these 2 methods you will see that it relies FastCharStream's method of IOException to 'flag' EOF: if (charsRead == -1) throw new IOException("read past eof"); (line 72-ish) Of course, we _could_ always write our own analyzer, but it would be real nice if the out-of-the-box one was even better. There was someone (apoligise, I've forgotten his name, I blame the holiday I just came back from) that could treat a relatively small file, such as an XML file, and very quickly index that for on the fly XPath like queries using Lucene which apparently works very well, but I'm not sure it scales to massive documents such as log files (and your requirements). Wolfgang Hoschek and the NUX project may be what you're referring to. He contributed the MemoryIndex feature found under contrib/ memory. I'm not sure that feature is a good fit for the log file or indexing files line-by-line though. Yes, Wolfgang's code is very cool, but would only work on small texts. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Float.floatToRawIntBits
I can confirm this takes ~ 20% of an overall Indexing operation (see attached link from YourKit). http://people.apache.org/~psmith/luceneYourkit.jpg Mind you, the whole "signalling via IOException" in the FastCharStream is a way bigger overhead, although I agree much harder to fix. Paul Smith On 17/11/2005, at 7:21 AM, Yonik Seeley wrote: Float.floatToRawIntBits (in Java1.4) gives the raw float bits without normalization (like *(int*)&floatvar would in C). Since it doesn't do normalization of NaN values, it's faster (and hopefully optimized to a simple inline machine instruction by the JVM). On my Pentium4, using floatToRawIntBits is over 5 times as fast as floatToIntBits. That can really add up in something like Similarity.floatToByte() for encoding norms, especially if used as a way to compress an array of float during query time as suggested by Doug. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Float.floatToRawIntBits
On 17/11/2005, at 9:24 AM, Doug Cutting wrote: In general I would not take this sort of profiler output too literally. If floatToRawIntBits is 5x faster, then you'd expect a 16% improvement from using it, but my guess is you'll see far less. Still, it's probably worth switching & measuring as it might be significant. Yes I don't think we'll get 5x speed update, as it will likely move the bottleneck back to the IO layer, but still... If you can reduce CPU usage, then multithreaded indexing operations can gain better CPU utilization (doing other stuff while waiting for IO). Seems like an easy win and dead easy to unit test? I've been meaning to have a crack at reworking FastCharStream but everytime I start thinking about it I realise there is a bit of a depency on this IOExecption signalling EOF that I'm pretty sure it's going to be much harder task. The JavaCC stuff is really designed for compiling tree's which is usually a 'once off' type usage, but Lucenes usage of it (large indexing operations) means the flaws in it are exacerbated. Paul smime.p7s Description: S/MIME cryptographic signature
Re: Float.floatToRawIntBits
On 17/11/2005, at 10:21 AM, Chris Lamprecht wrote: 1. Run profiler 2. Sort methods by CPU time spent 3. Optimize 4. Repeat :) Umm, well I know I could make it quicker, it's just whether it still _works_ as expected Maintaining the contract means I'll need to develop some good junit tests that I feel confident cover the current workings before making changes. That's the hard bit. Paul smime.p7s Description: S/MIME cryptographic signature
Re: NioFile cache performance
Most of the CPU time is actually used during the synchronization with multiple threads. I hacked together a version of MemoryLRUCache that used a ConcurrentHashMap from JDK 1.5, and it was another 50% faster ! At a minimum, if the ReadWriteLock class was modified to use the 1.5 facilities some significant additional performance gains should be realized. Would you be able to run the same test in JDK 1.4 but use the util.concurrent compatibility pack? (supposedly the same classes in Java5) It would be nice to verify whether the gain is the result of the different ConcurrentHashMap vs the different JDK itself.Paul Smith smime.p7s Description: S/MIME cryptographic signature
Re: "Advanced" query language
Hey all, I haven't been paying real close attention to this thread, but if any of you are looking for something that has _easy_ Object->XML->Object you should seriously try XStream (http://xstream.codehaus.org).. Simplest/easiest api I've seen. BSD licensed too (Apache friendly). One can register a Converter class to assist with anything the built- in converters don't handle well. The Convertor code is nice and elegant. Just something to think about maybe? cheers, Paul On 22/12/2005, at 11:20 AM, Chris Hostetter wrote: I finally got a chance to look at this code today (the best part about the last day before vacation, is no one expects you to get anything done, so you can ignore your "real work" and spend time on things that are more important in the long run) and while I still havne't wrapped my head arround all of it, I wanted to share my thoughts so far on the API... 1) I aplaud the plugable nature of your solution. Looking at the Test Case, it is easy to see exactly how a service provider could do things like override the behavior of a to be implimented as a SpanQuery without their clients being affected at all. Kudos. 2) Digging into what was involved in writting an ObjectBuilder, I found the api somewhat confusion. I was reminded of this exchange you had with Yonik... : > While SAX is fast, I've found callback interfaces : > more difficult to : > deal with while generating nested object graphs... : > it normally : > requires one to maintain state in stack(s). : : I've gone to some trouble to avoid the effects of this : on the programming model. As someone who feels very comfortable with Lucene, but has no practical experience with SAX, I have to say that I don't really feel like the API has a very clean seperation from SAX. I think that the ideal API wouldn't require people writing ObjectBuilders to know anything about sax, or to ever need to import anything from org.xml.** or javax.xml.** 3) While the *need* to maintaing/pass state information should be avoided. I can definitely think of uses for this framework that may *want* to pass state information -- both down to the ObjectBuilders that get used in inner nodes, as well as up to wrapping nodes, and there doesn't seem to be an easy way to that. (it could just be my lack of SAX knowledge though) The best example i can give is if someone (ie: me) wanted to use this framework to allow boolean queries to be written like this... "a phrase" fuzzy~ ...i want to be able to write an "BooleanClauseWrapperObjectBuilder" that can be wrapped around any other ObjectBuilder and will return whatever object it does, but will also check for and "occurs" attribute, and put that in a state bucket somewhere that the BooleanQuery has access to it when adding the Query it gets back. Going the ooposite direction, I'd like to be able to have tags that set state which is accesible to descendent tags (even if hte tags in teh middle don't know anything about that bit of state. for example: specifying how much slop should be used by default in phrase queries... ... How Now Brown Cow? ... I haven't had a chance to try implimenting this, but at a high level, it seems like all of this should be possible and still easy to use. Here's a real rough cut at what i've had floating arround in the back of my head (I'm doing this straight into email, pardon any typo's or psuedo code) ... /** could be implimented with SAX, or DOM, or Pull */ public interface LuceneXmlParser { /** this method will call setParser(this) on each handler */ public void registerHandler(String tag, LuceneXmlHandler h); /** primary method for clients, parses the xml and calls processNode on the root node */ public Query parse(InputStream xml); /** dispatches to the appropriate handler's process method based on the Node name, may be called by handlers for recursion of children nodes */ public Query processNode(LuceneXmlNode n, State s) } public interface LuceneXmlHandler { public void setParser(LuceneXmlParser p) /** should return a Query that corrisponds to the specified node. may rea/modify state in any way it wants ... it is recommended that all implimenting methods wrap their state before passing it on when processing children. */ public Query process(LuceneXmlNode n, State s) } /** A State is a stack frame that can delegate read operations to another State it wraps (if there is one). but it cannot delegate modifying operations. Classes implimenting State should provide a constructor that takes another State to wrap. */ public interface State extends Map { /** for callers that wnat to know what's in the immeidate stack frame without any delegation */ public Map getOuterFrame(); /* should return a new state that
Re: "Advanced" query language
On 03/01/2006, at 11:08 AM, markharw00d wrote: I thought you said you "didn't really want to have to design a general API for parsing XML as part of this project" ? :) Having grown tired of messing with my own solution I tried using commons Digester with my example XML but ran into issues so I'm back looking at a custom solution. Seriously... Did you try out Xstream? Digester is just too hard, Xstream will work so easily you'll be pleasantly suprised.. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JIRA html problems?
looks a bit b0rk3n to me as well. Maybe some text being displayed isn't being escaped properly causing HTML mayhem? Paul Smith On 27/01/2006, at 8:12 AM, Yonik Seeley wrote: I've been getting bad HTML out of JIRA lately: http://issues.apache.org/jira/browse/LUCENE Anyone else? -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tree based BitSet (aka IntegerSet, DocSet...)
Unfortunately, the license distributed with the JAR (which we must assume takes precedence over whatever is stated on the web pages) is much more restrictive, it's the Java Research License, which specifically disallows any commercial use. So, short of reimplementing it from scratch it's of no use except for academic study. Pity. Would that preclude re-implementing the same algorithm in new source code? I'm not clear on whether that violates the license. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tree based BitSet (aka IntegerSet, DocSet...)
No, I'm pretty sure it wouldn't, so long as you don't look at this code, lest you become "tainted" ... ;-) Isn't that where the phrase "I have no recollection of that Senator" comes in handy? :) Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Java 1.5
On 31/05/2006, at 7:45 AM, Robert Engels wrote: Log4j can be configured to use delegate to the standard 1.5 logging. In fact this is preferred so you have STANDARDIZED logging support (and not a different logger for every library). All NEW code should use the 1.5 logging for simplicity of configuration and for future ease of integration. Now, first off, I'm a log4j developer, so thought I'd state that up front. Java's own logging is 'ok', it does a job, but is no way close to the production quality sort of control over logging that you really need. I would recommend you do not bind yourself to the Java logging, but decide between JCL or log4j itself. I have a distinct anti-JCL feeling because of classloader hell issues in the past, but I understand the latest version is a lot better, and many Apache projects are already linked against it. Before you make any decision, I'd sit down and plan what events you'll actually want to log and at what level. Good planning will make the Lucene library very useful. You can then decide how you're going to log them. cheers, Paul Smith smime.p7s Description: S/MIME cryptographic signature
Re: svn commit: r437897 [1/2] - in /lucene/java/trunk: ./ src/java/org/apache/lucene/index/ src/java/org/apache/lucene/store/ src/test/org/apache/lucene/store/
Added: lucene/java/trunk/src/java/org/apache/lucene/index/ IndexWriter.java.orig lucene/java/trunk/src/java/org/apache/lucene/index/ doron_2_IndexWriter.patch (with props) Just in case this was accidental, was the orig and patch files meant to be added to the repo? cheers, Paul Smith smime.p7s Description: S/MIME cryptographic signature
Re: Is it save to remove the throw from FastCharStream.refill() ?
The throwing of an exception by this class is still being done on the Java side at this stage IIRC, and is also extremely bad for performance in Java. However I think the client of the class (one of the Filters I think) is expecting the EOF exception as a signal that it has received the end of the stream for tokenization point of view. I would love to get rid of it, but I think it will break a lot of behaviour. cheers, Paul Smith On 04/10/2006, at 11:48 AM, George Aroush wrote: Hi folks, Over at Lucene.Net, we are trying to determine if it's safe to do the following change: http://issues.apache.org/jira/browse/LUCENENET-8 Can you tell us, if this change is done on the Java Lucene code, how it will effect Lucene? Do you expect the it to run faster but more importantly, is it safe? Thanks. -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Is it safe to remove the throw from FastCharStream.refill() ?
Title: Aconex Email Template On 05/10/2006, at 3:34 PM, Doron Cohen wrote:If I read the JIRA issue right, it look as if this is fixed in Lucene 2.0.1. Is it?If so, where can I download 2.0.1? No 2.0.1 was released (yet).This issue is fixed in the "svn head".Nightly builds that include this (and other things) are found inhttp://people.apache.org/dist/lucene/java/nightly/Be aware that these are not announced releases, just nightly builds..But, that issue address a performance problem with IndexWriter, not the underlying FastCharStream+IOException==bad problem, I'm pretty sure that's still there.cheers,Paul smime.p7s Description: S/MIME cryptographic signature
Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)
On 16/12/2006, at 6:15 PM, Otis Gospodnetic wrote: Moving to java-dev, I think this belongs here. I've been looking at this problem some more today and reading about ThreadLocals. It's easy to misuse them and end up with memory leaks, apparently... and I think we may have this problem here. The problem here is that ThreadLocals are tied to Threads, and I think the assumption in TermInfosReader and SegmentReader is that (search) Threads are short-lived: they come in, scan the index, do the search, return and die. In this scenario, their ThreadLocals go to heaven with them, too, and memory is freed up. Otis, we have an index server being served inside Tomcat, where an Application instance makes a search request via vanilla HTTP post, so our connector threads definitely do stay alive for quite a while. We're using Lucene 2.0, and our Index server is THE most stable of all our components, up for over month (before being taken down for updates) searching hundreds of various sized indexes sized up to 7Gb in size, serving 1-2 requests/second during peak usage. No memory leak spotted at our end, but I'm watching this thread with interest! :) cheers, Paul Smith smime.p7s Description: S/MIME cryptographic signature
Large scale sorting
rtHitQueue et al, and I realize that the use of NIO immediately pins Lucene to Java 1.4, so I'm sure this is controversial. But, if we wish Lucene to go beyond where it is now, I think we need to start thinking about this particular problem sooner rather than later. Happy Easter to all, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
On 10/04/2007, at 4:18 AM, Doug Cutting wrote: Paul Smith wrote: Disadvantages to this approach: * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only practical if you can guarantee that queries match fewer than a hundred documents, which is not generally the case, especially with large collections. I don't disagree with the premise that it involves substantial I/O and would increase the time taken to sort, and why this approach shouldn't be the default mechanism, but it's not too difficult to build a disk I/O subsystem that can allocate many spindles to service this and to allow the underlying OS to use it's buffer cache (yes this is sounding like a database server now isn't it). I'm working on the basis that it's a LOT harder/more expensive to simply allocate more heap size to cover the current sorting infrastructure. One hits memory limits faster. Not everyone can afford 64-bit hardware with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build a disk subsystem to tune this I/O approach, and one can still use any RAM as buffer cache for the memory-mapped file anyway. In my experience, raw search time starts to climb towards one second per query as collections grow to around 10M documents (in round figures and with lots of assumptions). Thus, searching on a single CPU is less practical as collections grow substantially larger than 10M documents, and distributed solutions are required. So it would be convenient if sorting is also practical for ~10M document collections on standard hardware. If 10M strings with 20 characters are required in memory for efficient search, this requires 400MB. This is a lot, but not an unusual amount on todays machines. However, if you have a large number of fields, then this approach may be problematic and force you to consider a distributed solution earlier than you might otherwise. 400Mb is not a lot in of itself, but when one has many of these types of indexes, with many sorting fields with many locales on the same host it becomes problematic. I'm sure there's a point where distributing doesn't work over really large collections, because even if one partitioned an index across many hosts, one still needs to merge sort the results together. It would be disappointing if Lucene's innate design limited itself to 10M document collections before needing to consider distributed solutions. 10M is not that many. It would be better if the sorting mechanism in Lucene was a little more decoupled such that more customised designs could be utilitised for specific scenarios. Right now it's a one-for-all approach without substantial gutting of the code. cheers, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Now, if we could use integers to represent the sort field values, which is typically the case for most applications, maybe we can afford to have the sort field values stored in the disk and do disk lookup for each document matched? The look up of the sort field value will be as simple as docNo * 4 * offset. This way, we use the same approach as constructing the norms (proper merging for incremental indexing), but, at search time, we don't load the sort field values into memory, instead, just store them in disk. Will this approach be good enough? While a nifty idea, I think this only works for a single sort locale. I initially came up with a similar idea that the terms are already stored in 'sorted' order and one might be able to use the terms position for sorting, it's just that the terms ordering position is different in different locales. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
In our application, we have to sync up the index pretty frequently, the warm-up of the index is killing it. Yep, it speeds up the first sort, but at the cost of making all the others slower (maybe significantly so). That's obviously not ideal but could make use of sorts in larger indexes practical. To address your concern about single sort locale, what about creating a sort field for each sort locale? So, if you have, say, 10 locales, you will have 10 sort fields, each utilizing the mechanism of constructing the norms. I really don't understand norms properly so I'm not sure exactly how that would help. I'll have to go over your original email again to understand. My main goal is to get some discussion going amongst the community, which hopefully we've kicked along. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
A memory saving optimization would be to not load the corresponding String[] in the string index (as discussed previously), but there is currently no way to tell the FieldCachethat the strings are unneeded. The String values are only needed for merging results in a MultiSearcher. Yep, which happens all the time for us specifically, because we have an 'archive' and 'week' index. the week index is merged once per week, so the search is always a merged sort across the 2. (the week index is reloaded every 5 seconds or so, the archive index is kept in memory once loaded). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tests, Contribs, and Releases
Does Lucene have a gump run descriptor? That's quite useful for tracking this sort of thing too. It's very good at nagging! :) The standard maven assembly packaging runs the unit tests by default too. Changing the lucene build system to maven is not something you'd want to jump at without careful thought , but might be worth considering. I used to be anti-maven, but since version 2, and since Curt Arnold has been setting up the log4j build environment for maven, I've been quite impressed with it's capability. cheers, Paul Smith On 17/05/2007, at 8:02 AM, Chris Hostetter wrote: Hey everybody, this thread has been sitting in my inbox for a while waiting for me to have a few minutes to look into it... http://www.nabble.com/Packaging-Lucene-2.1.0-for-Debian--found-2- junit-errors-tf3571676.html In a nutshell, when a guy from Debian went looking to package Lucene he noticed that the official 2.1.0 release contained 2 test failures -- one each in the highlighter and spellchecker contribs. The specifics of the test failures don't really interest me as much as the question: how did we manage to have a release with test failures? A few things have jumped out at me while looking into this... 1) the task "build-contrib" can be used to walk the contrib directory building each contrib, the task "test-contrib" can be used to walk the contrib directory testing each contrib. 2) the "test" task only tests the lucene-core ... it does not depend on (or call) "test-contrib" 3) The "nightly" build task depends on the "test" and "package-tgz" task (which depends on "build-contrib") but at no point is "test- contrib" run. 4) The steps for creating an official release... http://wiki.apache.org/lucene-java/ReleaseTodo ...specify using the "dist" and "dist-src" tasks -- neither of which depend on *ANY* tests being run (let alone contrib tests) This seems very strange to me ... i would think that we would want: a) nightly builds to run the tests for all contribs, ala... b) the release insctructions to make it clear that all unit tests (core and contrib) should be run prior to creating teh distribution. Does anyone see any reason not to make these changes? -Hoss ----- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Paul Smith Core Engineering Manager Aconex The easy way to save time and money on your project 696 Bourke Street, Melbourne, VIC 3000, Australia Tel: +61 3 9240 0200 Fax: +61 3 9240 0299 Email: [EMAIL PROTECTED] www.aconex.com This email and any attachments are intended solely for the addressee. The contents may be privileged, confidential and/or subject to copyright or other applicable law. No confidentiality or privilege is lost by an erroneous transmission. If you have received this e-mail in error, please let us know by reply e-mail and delete or destroy this mail and all copies. If you are not the intended recipient of this message you must not disseminate, copy or take any action in reliance on it. The sender takes no responsibility for the effect of this message upon the recipient's computer system.
Re: Tests, Contribs, and Releases
To answer your question, though, I don't see any reason not to make the changes to make the current process more repeatable. Yeah, mod'ing the ant process now is going to be simpler to catch the current problem. Still, I'd check the Gump stuff for Lucene, because I'd be surprised that wouldn't have caught this and continuously nagged you all to get it fixed.. :) Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle servlet-api.jar in build?
On 12/06/2007, at 5:09 PM, markharw00d wrote: As part of the documentation push I was considering putting together an updated demo web app which showed a number of things (indexing, search, highlighting, XML Query templates etc) and was wondering what that might mean to the build system if I was dependent on the servlet API. Are there any licence concerns around handling servlet-api.jar that I should be aware of? I know Apache foundation does not like linking to non-Apache code. You should be fine on this, since many Apache apps need to reference the servlet-api and jsp-api jars. (JSTL for a start..). I just don't think you can 'package' up a distribution that includes these jars in your distribution. That is, a downloaded unit from Apache can't include that jar in the distribution. The log4j projects I work in references quite a few non-ASL licensed things, and as long as you can build a distribution environment that requires the user to download that (and agree to any licensing bits and bobs), you should be fine. This is where Maven is cool... cheers, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle servlet-api.jar in build?
On 12/06/2007, at 7:07 PM, mark harwood wrote: Thanks for the pointers Paul. I just don't think you can 'package' up a distribution that includes these jars in your distribution. Clearly the binary distribution need not bundle servlet-api.jar - a demo.war file is all that is needed. However, is the source distribution exempt from this restriction? It would be convenient if the build.xml "just worked", referencing our included copy of servlet-api.jar rather than requiring the user to configure the build.properties etc to point to their copy of the API. If this bundling was an issue, would an acceptable solution be to have an ANT task to download the servlet-api.jar from a Sun server? From other people's posts it sounds like the servlet spec jars have a compatible license (quite odd of Sun to do that! :) ), so perhaps my comments are more relevant to other licensed jars such as LGPL projects. The ant idea of automatically getting the file if not provided is a good one, i've done that before. Having said that though, this is exactly what Maven can do for you with it's dependency management. (I was anti-maven for quite a while, but now I'm converted). Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene upload to Maven 2 repository
On 19/06/2007, at 6:14 AM, Michael Busch wrote: Hello, looking at JIRA and the email archives I find several people asking us to upload Lucene to the Maven2 repository. Currently there are only the artifacts from Lucene core 1.9.1 and 2.0.0 in the repository. 1.9.1 is even incomplete, as LUCENE-867 indicates. Therefore I ported the maven patch (LUCENE-622) back to the releases 1.9.1, 2.0.0, and 2.1.0, added LICENSE.txt and NOTICE.txt to the jars and generated all maven artifacts for those releases (core + contribs). I uploaded everything for reviewing to the staging area: http://people.apache.org/~buschmi/staging_area/maven/. I made a few tests and the artifacts seem to be fine. I intend to upload everything to the maven repository when I officially release 2.2 unless there are objections. Any chance of adding source jars as artifacts too? Makes the Maven Eclipse plugin rather nice. I appreciate the effort in organizing the artifacts (particularly the older versions). cheers, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene upload to Maven 2 repository
I'm just kidding, of course! I'll try to take a look at that. However, making these artifacts was already a lot of work and I'm not sure how soon I can work on the source artifacts. I might try and grab the trunk and see if I can work out what's needed to do that.. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene upload to Maven 2 repository
quick check, I haven't tried the maven build system for lucene yet, but getting a clean trunk, and doing this: mvn -f lucene-parent-pom.xml -Dversion=2.2 install It appears to be ignoring the version property: Installing /workspace/lucene-svn/lucene-parent-pom.xml to /Users/ paulsmith/.m2/repository/org/apache/lucene/lucene-parent/@version@/ [EMAIL PROTECTED]@.pom Am I missing something ? Paul On 19/06/2007, at 10:15 AM, Michael Busch wrote: Paul Smith wrote: I might try and grab the trunk and see if I can work out what's needed to do that.. Paul That'd be great! In particular we need to figure out which changes to the pom.xml files are necessary. - Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Paul Smith Core Engineering Manager Aconex The easy way to save time and money on your project 696 Bourke Street, Melbourne, VIC 3000, Australia Tel: +61 3 9240 0200 Fax: +61 3 9240 0299 Email: [EMAIL PROTECTED] www.aconex.com This email and any attachments are intended solely for the addressee. The contents may be privileged, confidential and/or subject to copyright or other applicable law. No confidentiality or privilege is lost by an erroneous transmission. If you have received this e-mail in error, please let us know by reply e-mail and delete or destroy this mail and all copies. If you are not the intended recipient of this message you must not disseminate, copy or take any action in reliance on it. The sender takes no responsibility for the effect of this message upon the recipient's computer system.
Re: Lucene upload to Maven 2 repository
lucene_pom.patch Description: Binary data Attached is a quick patch for the lucene-core pom so that it does compile and package successfully:mvn -f lucene-core.pom.xml packageEnds up with a binary jar in the target/ sub-foldermvn assembly:assemblyCreates a source distribution in the target folder too. I'm assuming Lucene requires 1.4 compiled code (default Maven is 1.3).Possibly need to refactor some of this pom, eventually, into the parent pom, but it does build cleanly, other than the weird @version@ still appearing. I think that is because you are using ant as the primary build mechanism and forking some 'mavenness'. We've been mavenizing the log4j project, so I'm gaining some experience with this sort of stuff.cheers,PaulOn 19/06/2007, at 10:40 AM, Michael Busch wrote:Paul Smith wrote: quick check, I haven't tried the maven build system for lucene yet, but getting a clean trunk, and doing this: mvn -f lucene-parent-pom.xml -Dversion=2.2 installIt appears to be ignoring the version property:Installing /workspace/lucene-svn/lucene-parent-pom.xml to /Users/paulsmith/.m2/repository/org/apache/lucene/lucene-parent/@version@/[EMAIL PROTECTED]@.pom Am I missing something ? Yes... there is no maven build system for Lucene yet ;-)We actually build with ant and use the maven-ant-tasks to deploy the m2 artifacts. The pom.xml files in trunk are templates. The ant target "generate-maven-artifacts" takes those templates, replaces the @version@ by the actual version number and creates a maven dist directory where it deploys all the artifacts for the ibiblio upload.- Michael-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED] Paul SmithCore Engineering Manager AconexThe easy way to save time and money on your project696 Bourke Street, Melbourne,VIC 3000, Australia Tel: +61 3 9240 0200 Fax: +61 3 9240 0299Email: [EMAIL PROTECTED] www.aconex.com This email and any attachments are intended solely for the addressee. The contents may be privileged, confidential and/or subject to copyright or other applicable law. No confidentiality or privilege is lost by an erroneous transmission. If you have received this e-mail in error, please let us know by reply e-mail and delete or destroy this mail and all copies. If you are not the intended recipient of this message you must not disseminate, copy or take any action in reliance on it. The sender takes no responsibility for the effect of this message upon the recipient's computer system.
Re: Lucene upload to Maven 2 repository
Enhanced version of previous patch. Now compiles and executes all unit tests (although some of them are failing for me) mvn -f lucene-core.pom.xml test you can still do a package (including source distro) and skip the tests mvn -f lucene-core-pom.xml -Dmaven.test.skip=true package assembly:assembly cheers, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene upload to Maven 2 repository
*sigh*, with attachment this time: lucene_pom.2.patch Description: Binary data On 19/06/2007, at 11:42 AM, Paul Smith wrote:Enhanced version of previous patch. Now compiles and executes all unit tests (although some of them are failing for me)mvn -f lucene-core.pom.xml testyou can still do a package (including source distro) and skip the testsmvn -f lucene-core-pom.xml -Dmaven.test.skip=true package assembly:assemblycheers,Paul-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene upload to Maven 2 repository
On 19/06/2007, at 9:58 AM, Michael Busch wrote: Paul Smith wrote: Any chance of adding source jars as artifacts too? Makes the Maven Eclipse plugin rather nice. I appreciate the effort in organizing the artifacts (particularly the older versions). cheers, Paul In German we have a saying, something like "Offer them your pinky, and they rip off your whole hand." ;-) I'm just kidding, of course! I'll try to take a look at that. However, making these artifacts was already a lot of work and I'm not sure how soon I can work on the source artifacts. Incidentally for those wanting some links to refer to, the maven- assembly-plugin docs are here: http://maven.apache.org/plugins/maven-assembly-plugin/ Probably the best reference in this case is this one: http://maven.apache.org/plugins/maven-assembly-plugin/descriptor- refs.html cheers, Paul
maven snapshots available for 2.3?
I'm thinking no, but just in case, are lucene 2.3 snapshots published anywhere, or should I build one locally? More broadly, is there any plan to fully Mavenize the lucene trunk ? I'm not sure if anyone had a chance to look the patch I supplied a while back with a pom.xml that appeared to build and test ok. I'm happy to pitch in here. cheers, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357839 ] Paul Smith commented on LUCENE-467: --- I probably didn't make my testing framework as clear as I should. Yourkit was setup to use method sampling (waking up every X milliseconds). I wouldn't use the 20% as a 'accurate' figure but suffice to say that improving this method would 'certainly' improve things. Only testing the way you have will flush out the correct numbers. We don't use -server (due to some Linux vagaries we've been careful with -server because of some stability problems) > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-467) Use Float.floatToRawIntBits over Float.floatToIntBits
[ http://issues.apache.org/jira/browse/LUCENE-467?page=comments#action_12357925 ] Paul Smith commented on LUCENE-467: --- If you can create a patch against 1.4.3 there is a reasonable possibility that I could create a 1.4.3 Lucene+ThisPatch jar and re-index in our test environment that was the source of the YourKit graph I provided earlier. This should reflect how useful the change might be against a decent baseline? > Use Float.floatToRawIntBits over Float.floatToIntBits > - > > Key: LUCENE-467 > URL: http://issues.apache.org/jira/browse/LUCENE-467 > Project: Lucene - Java > Type: Improvement > Components: Other > Versions: 1.9 > Reporter: Yonik Seeley > Priority: Minor > > Copied From my Email: > Float.floatToRawIntBits (in Java1.4) gives the raw float bits without > normalization (like *(int*)&floatvar would in C). Since it doesn't do > normalization of NaN values, it's faster (and hopefully optimized to a > simple inline machine instruction by the JVM). > On my Pentium4, using floatToRawIntBits is over 5 times as fast as > floatToIntBits. > That can really add up in something like Similarity.floatToByte() for > encoding norms, especially if used as a way to compress an array of > float during query time as suggested by Doug. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
[ http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12427818 ] Paul Smith commented on LUCENE-388: --- geez, yep definitely don't put this in, my patch was only a 'suggestion' to highlight how it fixes the root cause of the problem. iIt is interesting that originally, all the test cases still pass, yet the problems Yonik highlights is real. Might warrant some extra test cases to cover exactly those situation, even if this problem is not addressed. Be great if this could be fixed completely though, but I haven't got any headspace left to continue research on this one.. sorry :( > [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources > > > Key: LUCENE-388 > URL: http://issues.apache.org/jira/browse/LUCENE-388 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Mac OS X 10.3 > Platform: Macintosh >Reporter: Paul Smith > Attachments: IndexWriter.patch, log-compound.txt, > log.optimized.deep.txt, log.optimized.txt, Lucene Performance Test - with & > without hack.xls, lucene.34930.patch > > > Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD. > Analysis using hprof utility shows that during index creation with many > documents highlights that the CPU spends a large portion of it's time in > IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with > other valuable CPU intensive operations such as tokenization etc. > Using the following test snippet to retrieve some rows from the db and create > an > index: > Analyzer a = new StandardAnalyzer(); > writer = new IndexWriter(indexDir, a, true); > writer.setMergeFactor(1000); > writer.setMaxBufferedDocs(1); > writer.setUseCompoundFile(false); > connection = DriverManager.getConnection( > "jdbc:inetdae7:tower.aconex.com?database=", "secret", > "squirrel"); > String sql = "select userid, userfirstname, userlastname, email from > userx"; > LOG.info("sql=" + sql); > Statement statement = connection.createStatement(); > statement.setFetchSize(5000); > LOG.info("Executing sql"); > ResultSet rs = statement.executeQuery(sql); > LOG.info("ResultSet retrieved"); > int row = 0; > LOG.info("Indexing users"); > long begin = System.currentTimeMillis(); > while (rs.next()) { > int userid = rs.getInt(1); > String firstname = rs.getString(2); > String lastname = rs.getString(3); > String email = rs.getString(4); > String fullName = firstname + " " + lastname; > Document doc = new Document(); > doc.add(Field.Keyword("userid", userid+"")); > doc.add(Field.Keyword("firstname", firstname.toLowerCase())); > doc.add(Field.Keyword("lastname", lastname.toLowerCase())); > doc.add(Field.Text("name", fullName.toLowerCase())); > doc.add(Field.Keyword("email", email.toLowerCase())); > writer.addDocument(doc); > row++; > if((row % 100)==0){ > LOG.info(row + " indexed"); > } > } > double end = System.currentTimeMillis(); > double diff = (end-begin)/1000; > double rate = row/diff; > LOG.info("rate:" +rate); > On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed > out, > and I end up getting a rate of indexing between 490-515 documents/second run > over 10 times in succession. > By applying a simple patch to IndexWriter (see attached shortly), which defers > the calling of maybeMergeSegments() so that it is only called every 2000 > times(an arbitrary figure), I appear to get a new rate of between 945-970 > documents/second. Using Luke to look inside each index created between these > 2 > there does not appear to be any difference. Same number of Documents, same > number of Terms. > I'm not suggesting one should apply this patch, I'm just highlighting the > difference in performance that this sort of change gives you. > We are about to use Lucene to index 4 million construction document records, > and > so speedi
[jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
[ http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12427975 ] Paul Smith commented on LUCENE-388: --- This is where some tracing logging code would be useful. Maybe a YourKit memory snapshot to see what's going on.. ? I can't see Yonik's patch should influence the memory profile. It's just delaying the check for merging until an appropriate time, and should not be removing opportunities to merge segments. I can't see why checking less often uses more memory. Obviously something strange is happening. > [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources > > > Key: LUCENE-388 > URL: http://issues.apache.org/jira/browse/LUCENE-388 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Mac OS X 10.3 > Platform: Macintosh >Reporter: Paul Smith > Assigned To: Yonik Seeley > Attachments: IndexWriter.patch, log-compound.txt, > log.optimized.deep.txt, log.optimized.txt, Lucene Performance Test - with & > without hack.xls, lucene.34930.patch, yonik_indexwriter.diff > > > Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD. > Analysis using hprof utility shows that during index creation with many > documents highlights that the CPU spends a large portion of it's time in > IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with > other valuable CPU intensive operations such as tokenization etc. > Using the following test snippet to retrieve some rows from the db and create > an > index: > Analyzer a = new StandardAnalyzer(); > writer = new IndexWriter(indexDir, a, true); > writer.setMergeFactor(1000); > writer.setMaxBufferedDocs(1); > writer.setUseCompoundFile(false); > connection = DriverManager.getConnection( > "jdbc:inetdae7:tower.aconex.com?database=", "secret", > "squirrel"); > String sql = "select userid, userfirstname, userlastname, email from > userx"; > LOG.info("sql=" + sql); > Statement statement = connection.createStatement(); > statement.setFetchSize(5000); > LOG.info("Executing sql"); > ResultSet rs = statement.executeQuery(sql); > LOG.info("ResultSet retrieved"); > int row = 0; > LOG.info("Indexing users"); > long begin = System.currentTimeMillis(); > while (rs.next()) { > int userid = rs.getInt(1); > String firstname = rs.getString(2); > String lastname = rs.getString(3); > String email = rs.getString(4); > String fullName = firstname + " " + lastname; > Document doc = new Document(); > doc.add(Field.Keyword("userid", userid+"")); > doc.add(Field.Keyword("firstname", firstname.toLowerCase())); > doc.add(Field.Keyword("lastname", lastname.toLowerCase())); > doc.add(Field.Text("name", fullName.toLowerCase())); > doc.add(Field.Keyword("email", email.toLowerCase())); > writer.addDocument(doc); > row++; > if((row % 100)==0){ > LOG.info(row + " indexed"); > } > } > double end = System.currentTimeMillis(); > double diff = (end-begin)/1000; > double rate = row/diff; > LOG.info("rate:" +rate); > On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed > out, > and I end up getting a rate of indexing between 490-515 documents/second run > over 10 times in succession. > By applying a simple patch to IndexWriter (see attached shortly), which defers > the calling of maybeMergeSegments() so that it is only called every 2000 > times(an arbitrary figure), I appear to get a new rate of between 945-970 > documents/second. Using Luke to look inside each index created between these > 2 > there does not appear to be any difference. Same number of Documents, same > number of Terms. > I'm not suggesting one should apply this patch, I'm just highlighting the > difference in performance that this sort of change gives you. > We are about to use Lucene to index 4 million construction document records, > and > so speed
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436437 ] Paul Smith commented on LUCENE-675: --- If you're looking for freely available text in bulk, what about: http://www.gutenberg.org/wiki/Main_Page > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436443 ] Paul Smith commented on LUCENE-675: --- >From a strict performance point of view, a standard set of important, but >don't forget other languages. >From a tokenization point of view (seperate to this issues), perhaps the >Gutenberg project would be useful to test correctness of the analysis phase. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-833) Indexing of Subversion Repositories.
[ https://issues.apache.org/jira/browse/LUCENE-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481324 ] Paul Smith commented on LUCENE-833: --- You should try Fisheye! It uses Lucene internally. http://www.cenqua.com/fisheye > Indexing of Subversion Repositories. > > > Key: LUCENE-833 > URL: https://issues.apache.org/jira/browse/LUCENE-833 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Matt Inger > > It would be a big help if Lucene had the ability to index Subversion (or CVS, > or whatever) repositories, including revision history. > Searches (beyond basic text of the source code) might include: > path:/branches/mybranch Foo > history:Foo -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1741) Make MMapDirectory.MAX_BBUF user configureable to support chunking the index files in smaller parts
[ https://issues.apache.org/jira/browse/LUCENE-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730551#action_12730551 ] Paul Smith commented on LUCENE-1741: An algorithm is nice if there are no specific settings specified, but in an environment where large indexes may be opened more frequently than the common use cases, then what is happening is that the Memory layer is getting OOM conditions too much, forcing too much GC activity to attempt the operation. I'd vote for checking if settings have been requested and using them, and if not set rely on a self-tuning algorithm. In a really long running application, the process address space may become more and more fragmented, and the malloc library may not be able to defragment it, so the auto-tuning is nice, but it may not be great for all peoples needs. For example, our specific use case (crazy as this may be) is to have many different indexes open at any one time, closing and opening them frequently (the Realtime Search stuff we are following very closely indeed.. :) ). I'm just thinking that our VM (64bit) may find it difficult to find the contiguous non-heap space for the MMap operation after many days/weeks in operation. Maybe I'm just paranoid. But for operational purposes, it'd be nice to know we could change the setting based on our observations. thanks! > Make MMapDirectory.MAX_BBUF user configureable to support chunking the index > files in smaller parts > --- > > Key: LUCENE-1741 > URL: https://issues.apache.org/jira/browse/LUCENE-1741 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1741.patch, LUCENE-1741.patch > > > This is a followup for java-user thred: > http://www.lucidimagination.com/search/document/9ba9137bb5d8cb78/oom_with_2_9#9bf3b5b8f3b1fb9b > It is easy to implement, just add a setter method for this parameter to > MMapDir. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1749) FieldCache introspection API
[ https://issues.apache.org/jira/browse/LUCENE-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735242#action_12735242 ] Paul Smith commented on LUCENE-1749: You know what would be absolute icing on the cake here would be some way during the introspection by some code looking for large sort fields that perhaps can be discarded/unloaded as needed (programmatically). What I'm thinking here is a use case we've come into where we have had to sort by subject. Well the unique # subjects gets pretty large, and while we still need to support the use case, it'd be nice to be able to periodically 'toss' sort fields like this so they don't hog memory permanently while the IndexReader is still in memory. (sorting by subject is used, just not often so a good candidate for tossing) Because we have multiple large IndexReaders open concurrently, it'd be nice to be able to scan periodically and kick out any unneeded ones. It's nice to be able to inspect and print out these, but even better if one can make changes based on what one finds. > FieldCache introspection API > > > Key: LUCENE-1749 > URL: https://issues.apache.org/jira/browse/LUCENE-1749 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Hoss Man >Priority: Minor > Fix For: 2.9 > > Attachments: fieldcache-introspection.patch, LUCENE-1749.patch, > LUCENE-1749.patch, LUCENE-1749.patch, LUCENE-1749.patch > > > FieldCache should expose an Expert level API for runtime introspection of the > FieldCache to provide info about what is in the FieldCache at any given > moment. We should also provide utility methods for sanity checking that the > FieldCache doesn't contain anything "odd"... >* entries for the same reader/field with different types/parsers >* entries for the same field/type/parser in a reader and it's subreader(s) >* etc... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1935) Generify PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761395#action_12761395 ] Paul Smith commented on LUCENE-1935: I shall perhaps regret asking this, but is there any reason not to use java.util.PriorityQueue instead? Seems like reinventing the wheel a bit there (I understand historically why Lucene has this class). (is Lucene 2.9+ now Java 5, or is that a different discussion altogether?) > Generify PriorityQueue > -- > > Key: LUCENE-1935 > URL: https://issues.apache.org/jira/browse/LUCENE-1935 > Project: Lucene - Java > Issue Type: Task > Components: Other >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.0 > > Attachments: LUCENE-1935.patch > > > Priority Queue should use generics like all other Java 5 Collection API > classes. This very simple, but makes code more readable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1935) Generify PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761408#action_12761408 ] Paul Smith commented on LUCENE-1935: thanks Uwe, I thought I would regret asking, good points there. Shame the JDK doesn't have a fixed size PriorityQueue implementation, that seems a bit of a glaring omission. > Generify PriorityQueue > -- > > Key: LUCENE-1935 > URL: https://issues.apache.org/jira/browse/LUCENE-1935 > Project: Lucene - Java > Issue Type: Task > Components: Other >Affects Versions: 2.9 >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: 3.0 > > Attachments: LUCENE-1935.patch > > > Priority Queue should use generics like all other Java 5 Collection API > classes. This very simple, but makes code more readable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
[ https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595946#action_12595946 ] Paul Smith commented on LUCENE-1282: Another workaround might be to use '-client' instead of the default '-server' (for server class machines). This affects a few things, not least this switch: -XX:CompileThreshold=1 Number of method invocations/branches before compiling [-client: 1,500] -server implies a 1 value. I have personally observed similar behaviour like problems like the above with -server, and usually -client ends up 'solving' them. I'm sure there was also a way to mark a method to not jit compile too (rather than resort to -Xint which disables i for everything), but now I cant' find what that syntax is at all. > Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene > -- > > Key: LUCENE-1282 > URL: https://issues.apache.org/jira/browse/LUCENE-1282 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > > This is not a Lucene bug. It's an as-yet not fully characterized Sun > JRE bug, as best I can tell. I'm opening this to gather all things we > know, and to work around it in Lucene if possible, and maybe open an > issue with Sun if we can reduce it to a compact test case. > It's hit at least 3 users: > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL > PROTECTED] > It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects > Lucene. Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06 > shows it. > The bug affects bulk merging of stored fields. When it strikes, the > segment produced by a merge is corrupt because its fdx file (stored > fields index file) is missing one document. After iterating many > times with the first user that hit this, adding diagnostics & > assertions, its seems that a call to fieldsWriter.addDocument some > either fails to run entirely, or, fails to invoke its call to > indexStream.writeLong. It's as if when hotspot compiles a method, > there's some sort of race condition in cutting over to the compiled > code whereby a single method call fails to be invoked (speculation). > Unfortunately, this corruption is silent when it occurs and only later > detected when a merge tries to merge the bad segment, or an > IndexReader tries to open it. Here's a typical merge exception: > {code} > Exception in thread "Thread-10" > org.apache.lucene.index.MergePolicy$MergeException: > org.apache.lucene.index.CorruptIndexException: > doc counts differ for segment _3gh: fieldsReader shows 15999 but > segmentInfo shows 16000 > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271) > Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ > for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000 > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) > {code} > and here's a typical exception hit when opening a searcher: > {code} > org.apache.lucene.index.CorruptIndexException: doc counts differ for segment > _kk: fieldsReader shows 72670 but segmentInfo shows 72671 > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230) > at > org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73) > at > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636) > at > or
[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
[ https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596964#action_12596964 ] Paul Smith commented on LUCENE-1282: Throwing up an idea here for consideration. I'm sure it could be shot down, but I thought I'd raise it just in case it hasn't already been considered and discarded.. One of the _classic_ problems between -client and -server mode is the way the CPU registers are used. Is it possible that some of the fields are suffering from concurrency issues? I was wondering if, say, BufferedInfexOutput.buffer* may need to be marked volatile ? One easy way to test if this makes a difference is to just try switching between explicit use of '-client' and '-server'. Most newer machines (even desktops & laptops) appear to qualify for Sun's 'am I a server-class machine' check. By switching to -client, if these problems disappear, this to me would smell more and more like a 'volatile' like behaviour, because AIUI, -server will be more aggressive with some of it's register optimizations and I've seen behaviour just like this where variables that have clearly been written, the changes are not 'appearing' on the other side. Even the same thread marking the change can be switched across to a different CPU right in the middle, and could see different results. Of course those people with lots of concurrency experience can probably dismiss this theory in a second, but that's fine. > Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene > -- > > Key: LUCENE-1282 > URL: https://issues.apache.org/jira/browse/LUCENE-1282 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: corrupt_merge_out15.txt > > > This is not a Lucene bug. It's an as-yet not fully characterized Sun > JRE bug, as best I can tell. I'm opening this to gather all things we > know, and to work around it in Lucene if possible, and maybe open an > issue with Sun if we can reduce it to a compact test case. > It's hit at least 3 users: > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL > PROTECTED] > It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects > Lucene. Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06 > shows it. > The bug affects bulk merging of stored fields. When it strikes, the > segment produced by a merge is corrupt because its fdx file (stored > fields index file) is missing one document. After iterating many > times with the first user that hit this, adding diagnostics & > assertions, its seems that a call to fieldsWriter.addDocument some > either fails to run entirely, or, fails to invoke its call to > indexStream.writeLong. It's as if when hotspot compiles a method, > there's some sort of race condition in cutting over to the compiled > code whereby a single method call fails to be invoked (speculation). > Unfortunately, this corruption is silent when it occurs and only later > detected when a merge tries to merge the bad segment, or an > IndexReader tries to open it. Here's a typical merge exception: > {code} > Exception in thread "Thread-10" > org.apache.lucene.index.MergePolicy$MergeException: > org.apache.lucene.index.CorruptIndexException: > doc counts differ for segment _3gh: fieldsReader shows 15999 but > segmentInfo shows 16000 > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271) > Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ > for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000 > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) > at > org.apache.lucene.index.ConcurrentMergeSch
[jira] Commented: (LUCENE-1282) Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
[ https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613208#action_12613208 ] Paul Smith commented on LUCENE-1282: Can anyone comment as to whether the JRE 1.6.04+ bug affects any _earlier_ versions of Lucene? (say, 2.0.. which we're still using) . I was just reviewing this issue and noticed Michael mentioned this behaviour shows in both the ConcurrentMergeScheduler and the SerialMergeScheduler. AIUI,. the SerialMergeScheduler is effectively the 'old' way of previous versions of Lucene, so I'm just starting to think about what affect 1.6.04 might have on earlier versions? (this bug is only marked as affecting 2.3+). The reason I ask is that we're just about to upgrade to 1.6.04 -server in some of our production machines.. (reason why not going to 1.6.06 is we only started our development test cycle months ago and stuck with .04 until next cycle). > Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene > -- > > Key: LUCENE-1282 > URL: https://issues.apache.org/jira/browse/LUCENE-1282 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3, 2.3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: corrupt_merge_out15.txt, crashtest, crashtest.log, > hs_err_pid27359.log > > > This is not a Lucene bug. It's an as-yet not fully characterized Sun > JRE bug, as best I can tell. I'm opening this to gather all things we > know, and to work around it in Lucene if possible, and maybe open an > issue with Sun if we can reduce it to a compact test case. > It's hit at least 3 users: > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL > PROTECTED] > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL > PROTECTED] > It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects > Lucene. Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06 > shows it. > The bug affects bulk merging of stored fields. When it strikes, the > segment produced by a merge is corrupt because its fdx file (stored > fields index file) is missing one document. After iterating many > times with the first user that hit this, adding diagnostics & > assertions, its seems that a call to fieldsWriter.addDocument some > either fails to run entirely, or, fails to invoke its call to > indexStream.writeLong. It's as if when hotspot compiles a method, > there's some sort of race condition in cutting over to the compiled > code whereby a single method call fails to be invoked (speculation). > Unfortunately, this corruption is silent when it occurs and only later > detected when a merge tries to merge the bad segment, or an > IndexReader tries to open it. Here's a typical merge exception: > {code} > Exception in thread "Thread-10" > org.apache.lucene.index.MergePolicy$MergeException: > org.apache.lucene.index.CorruptIndexException: > doc counts differ for segment _3gh: fieldsReader shows 15999 but > segmentInfo shows 16000 > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271) > Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ > for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000 > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) > {code} > and here's a typical exception hit when opening a searcher: > {code} > org.apache.lucene.index.CorruptIndexException: doc counts differ for segment > _kk: fieldsReader shows 72670 but segmentInfo shows 72671 > at > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.ja
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term
[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628480#action_12628480 ] Paul Smith commented on LUCENE-1372: Having a Document sorted last because it has "zebra", even though it has "apple" seems way incorrect. Yes it would be ideal if Lucene _could_ perform the multi-term sort properly, but in the absence of an effective fix in the short term, having the lexographically earlier term 'picked' as the primary sort candidate is likely to generate results that match what users would expect (even if it's not quite perfect). Right now it looks blatantly silly at the presentation layer when one presents the search results with their data, and show that "apple,zebra" appears last in the list.. > Proposal: introduce more sensible sorting when a doc has multiple values for > a term > --- > > Key: LUCENE-1372 > URL: https://issues.apache.org/jira/browse/LUCENE-1372 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.2 >Reporter: Paul Cowan >Priority: Minor > Attachments: lucene-multisort.patch > > > At the moment, FieldCacheImpl has somewhat disconcerting values when sorting > on a field for which multiple values exist for one document. For example, > imagine a field "fruit" which is added to a document multiple times, with the > values as follows: > doc 1: {"apple"} > doc 2: {"banana"} > doc 3: {"apple", "banana"} > doc 4: {"apple", "zebra"} > if one sorts on the field "fruit", the loop in > FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other > methods in the various FieldCacheImpl caches) does the following: > while (termDocs.next()) { > retArray[termDocs.doc()] = t; > } > which means that we look over the terms in their natural order and, on each > one, overwrite retArray[doc] with the value for each document with that term. > Effectively, this overwriting means that a string sort in this circumstance > will sort by the LAST term lexicographically, so the docs above will > effecitvely be sorted as if they had the single values ("apple", "banana", > "banana", "zebra") which is nonintuitive. To change this to sort on the first > time in the TermEnum seems relatively trivial and low-overhead; while it's > not perfect (it's not local-aware, for example) the behaviour seems much more > sensible to me. Interested to see what people think. > Patch to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term
[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628513#action_12628513 ] Paul Smith commented on LUCENE-1372: bq. I'm not following this argument. Will it be less silly when {zebra,apple} sorts before {banana} ? Well, at the presentation layer I don't think you'd present it like that (we don't). We'd sort the list of attributes so that it would appear as "apple,zebra". > Proposal: introduce more sensible sorting when a doc has multiple values for > a term > --- > > Key: LUCENE-1372 > URL: https://issues.apache.org/jira/browse/LUCENE-1372 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.2 >Reporter: Paul Cowan >Priority: Minor > Attachments: lucene-multisort.patch > > > At the moment, FieldCacheImpl has somewhat disconcerting values when sorting > on a field for which multiple values exist for one document. For example, > imagine a field "fruit" which is added to a document multiple times, with the > values as follows: > doc 1: {"apple"} > doc 2: {"banana"} > doc 3: {"apple", "banana"} > doc 4: {"apple", "zebra"} > if one sorts on the field "fruit", the loop in > FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other > methods in the various FieldCacheImpl caches) does the following: > while (termDocs.next()) { > retArray[termDocs.doc()] = t; > } > which means that we look over the terms in their natural order and, on each > one, overwrite retArray[doc] with the value for each document with that term. > Effectively, this overwriting means that a string sort in this circumstance > will sort by the LAST term lexicographically, so the docs above will > effecitvely be sorted as if they had the single values ("apple", "banana", > "banana", "zebra") which is nonintuitive. To change this to sort on the first > time in the TermEnum seems relatively trivial and low-overhead; while it's > not perfect (it's not local-aware, for example) the behaviour seems much more > sensible to me. Interested to see what people think. > Patch to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux
[ https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648768#action_12648768 ] Paul Smith commented on LUCENE-1342: java version "1.6.0_10" Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode) For clarity, there's 2 Paul's, myself included, and Alison here on the discussion thread, all from Aconex (we're all talking about the same problem at the same company, but are sharing in the discussion based on different analysis we're doing. We've recently upgraded to using Lucene 2.2 from 2.0 (yes, way behind, but we're cautious here..), and about 4 days from going into production with it. First off, an observation. The original bug report here was reported against Lucene 2.0, which we've been using in production for nearly 2 years against a few different JVM's (Java 1.5, plus a few builds of Java 1.6 up to and including 1.6.04). We've never encountered this in production or in our load test area using Lucene 2.0. However as soon as we switched to Lucene 2.2, using the same JRE as production (1.6.04), we started seeing these problems. After reviewing another HotSpot crash bug (LUCENE-1282) we decided to see if JRE 1.6.010 made a difference. Initially it did, we didn't find a problem with several load testing runs and we thought we were fine. Then a few weeks later, we started to see it occurring more frequently, yet none of the code changes in our application since the initial 1.6.010 switch could logically be connected to the indexing system at all (our application is spilt between an App, and an Index/Search server, and the SVN diff between the load testing tag runs didn't have any code change that was Indexer/Search related). At the same time we had a strange network problem going on in the load testing area that was causing problems with the App talking to the Indexer, which was caused by a local DNS problem. Inexplicably the JRE crash hasn't happened that I'm aware of; how that is related to the JRE hotspot compilation of Lucene byte-code, I have no idea.. BUT, since we had several weeks of stability and then several crashes, this is purely anecdotal/coincidental. I'm still rubbing my rabbits foot here. I need to chat with Alison & Paul Cowan about this to get more specific details about if/when the crash has occurred since the DNS problem was resolved, because it could purely be a statistical anomaly (we simply may not have done many runs to flush it out), and frankly I could be mistaken in the # crashes in the load testing env. For incremental indexing (which is what is happening during the load test that crashes) we are using compound file format, merge factor =default(10), minMergeDocs=200, maxMergeDocs=Default(MAX_INT). it's pretty vanilla really.. (the reason for a low mergeFactor is that we have several hundred indexes open at the same time for different projects, so open file handles becomes a problem). I'll let Alison/Paul Cowan comment further, this is just my 5 Aussie cents worth. > 64bit JVM crashes on Linux > -- > > Key: LUCENE-1342 > URL: https://issues.apache.org/jira/browse/LUCENE-1342 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: 2.6.18-53.el5 x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.6.0_04-b12) >Reporter: Kevin Richards > > Whilst running lucene in our QA environment we received the following > exception. This problem was also reported here : > http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems. > Is this a JVM problem or a problem in Lucene. > # > # An unexpected error has been detected by Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352 > # > # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64) > # Problematic frame: > # V [libjvm.so+0x1fce3f] > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > --- T H R E A D --- > Current thread (0x2aab0007f000): JavaThread "CompilerThread0" daemon > [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)] > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > si_addr=0x > Registers: > RAX=0x, RBX=0x2aab0007f000, RCX=0x, > RDX=0x2aab00309aa0 > RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2
[jira] Updated: (LUCENE-1342) 64bit JVM crashes on Linux
[ https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Smith updated LUCENE-1342: --- Attachment: hs_err_pid27882.log hs_err_pid21301.log 2 crash dumps attached. > 64bit JVM crashes on Linux > -- > > Key: LUCENE-1342 > URL: https://issues.apache.org/jira/browse/LUCENE-1342 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: 2.6.18-53.el5 x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.6.0_04-b12) >Reporter: Kevin Richards > Attachments: hs_err_pid10565.log, hs_err_pid21301.log, > hs_err_pid27882.log > > > Whilst running lucene in our QA environment we received the following > exception. This problem was also reported here : > http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems. > Is this a JVM problem or a problem in Lucene. > # > # An unexpected error has been detected by Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352 > # > # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64) > # Problematic frame: > # V [libjvm.so+0x1fce3f] > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > --- T H R E A D --- > Current thread (0x2aab0007f000): JavaThread "CompilerThread0" daemon > [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)] > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > si_addr=0x > Registers: > RAX=0x, RBX=0x2aab0007f000, RCX=0x, > RDX=0x2aab00309aa0 > RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, > RDI=0x2aaad000 > R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, > R11=0x2b410810 > R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, > R15=0x2aab00ae60b0 > RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, > ERR=0x0004 > TRAPNO=0x000e > Top of Stack: (sp=0x40b10f60) > 0x40b10f60: 2aab0007f000 > 0x40b10f70: 2aab0a19cc30 0001 > 0x40b10f80: 2aab0007f000 > 0x40b10f90: 40b10fe0 2aab0a19cc30 > 0x40b10fa0: 2aab0a19cc30 2aab00ae60b0 > 0x40b10fb0: 40b10fe0 2ae9c2e4 > 0x40b10fc0: 2b413210 2b413350 > 0x40b10fd0: 40b112f0 2aab09796260 > 0x40b10fe0: 40b110e0 2ae9d7d8 > 0x40b10ff0: 2b40f3d0 2aab08c2a4c8 > 0x40b11000: 40b11940 2aab09796260 > 0x40b11010: 2aab09795b28 > 0x40b11020: 2aab08c2a4c8 2aab009b9750 > 0x40b11030: 2aab09796260 40b11940 > 0x40b11040: 2b40f3d0 2023 > 0x40b11050: 40b11940 2aab09796260 > 0x40b11060: 40b11090 2b0f199e > 0x40b11070: 40b11978 2aab08c2a458 > 0x40b11080: 2b413210 2023 > 0x40b11090: 40b110e0 2b0f1fcf > 0x40b110a0: 2023 2aab09796260 > 0x40b110b0: 2aab08c2a3c8 40b123b0 > 0x40b110c0: 2aab08c2a458 40b112f0 > 0x40b110d0: 2b40f3d0 2aab00043670 > 0x40b110e0: 40b11160 2b0e808d > 0x40b110f0: 2aab000417c0 2aab009b66a8 > 0x40b11100: 2aab009b9750 > 0x40b0: 40b112f0 2aab009bb360 > 0x40b11120: 0003 40b113d0 > 0x40b11130: 01002aab0052d0c0 40b113d0 > 0x40b11140: 00b3 40b112f0 > 0x40b11150: 40b113d0 2aab08c2a108 > Instructions: (pc=0x2adb9e3f) > 0x2adb9e2f: 48 89 5d b0 49 8b 55 08 49 8b 4c 24 08 48 8b 32 > 0x2adb9e3f: 4c 8b 21 8b 4e 1c 49 8d 7c 24 10 89 cb 4a 39 34 > Stack: [0x40a13000,0x40b14000], sp=0x40b10f60, free > space=1015k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > V [libjvm.so+0x1fce3f] > V [libjvm.so+0x2df2e4] > V [libjvm.so+0x2e07d8] > V [libjvm.so+0x52b08d] > V [libjvm
[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux
[ https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650051#action_12650051 ] Paul Smith commented on LUCENE-1342: yeah, it's definitely a Sun bug, not a Lucene one, but like the other recent JVM crash issue it sort of 'affects' Lucene specifically. Must be something about that byte code. No idea why it does/does not trigger it. We've raised a Sun bug, but it hasn't 'appeared' online yet (Paul Cowan raised it). Will post the cross link to it once we have confirmation that Sun has deemed it 'worthy' to accept it. > 64bit JVM crashes on Linux > -- > > Key: LUCENE-1342 > URL: https://issues.apache.org/jira/browse/LUCENE-1342 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: 2.6.18-53.el5 x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.6.0_04-b12) >Reporter: Kevin Richards > Attachments: hs_err_pid10565.log, hs_err_pid21301.log, > hs_err_pid27882.log > > > Whilst running lucene in our QA environment we received the following > exception. This problem was also reported here : > http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems. > Is this a JVM problem or a problem in Lucene. > # > # An unexpected error has been detected by Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352 > # > # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64) > # Problematic frame: > # V [libjvm.so+0x1fce3f] > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > --- T H R E A D --- > Current thread (0x2aab0007f000): JavaThread "CompilerThread0" daemon > [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)] > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > si_addr=0x > Registers: > RAX=0x, RBX=0x2aab0007f000, RCX=0x, > RDX=0x2aab00309aa0 > RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, > RDI=0x2aaad000 > R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, > R11=0x2b410810 > R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, > R15=0x2aab00ae60b0 > RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, > ERR=0x0004 > TRAPNO=0x000e > Top of Stack: (sp=0x40b10f60) > 0x40b10f60: 2aab0007f000 > 0x40b10f70: 2aab0a19cc30 0001 > 0x40b10f80: 2aab0007f000 > 0x40b10f90: 40b10fe0 2aab0a19cc30 > 0x40b10fa0: 2aab0a19cc30 2aab00ae60b0 > 0x40b10fb0: 40b10fe0 2ae9c2e4 > 0x40b10fc0: 2b413210 2b413350 > 0x40b10fd0: 40b112f0 2aab09796260 > 0x40b10fe0: 40b110e0 2ae9d7d8 > 0x40b10ff0: 2b40f3d0 2aab08c2a4c8 > 0x40b11000: 40b11940 2aab09796260 > 0x40b11010: 2aab09795b28 > 0x40b11020: 2aab08c2a4c8 2aab009b9750 > 0x40b11030: 2aab09796260 40b11940 > 0x40b11040: 2b40f3d0 2023 > 0x40b11050: 40b11940 2aab09796260 > 0x40b11060: 40b11090 2b0f199e > 0x40b11070: 40b11978 2aab08c2a458 > 0x40b11080: 2b413210 2023 > 0x40b11090: 40b110e0 2b0f1fcf > 0x40b110a0: 2023 2aab09796260 > 0x40b110b0: 2aab08c2a3c8 40b123b0 > 0x40b110c0: 2aab08c2a458 40b112f0 > 0x40b110d0: 2b40f3d0 2aab00043670 > 0x40b110e0: 40b11160 2b0e808d > 0x40b110f0: 2aab000417c0 2aab009b66a8 > 0x40b11100: 2aab009b9750 > 0x40b0: 40b112f0 2aab009bb360 > 0x40b11120: 0003 40b113d0 > 0x40b11130: 01002aab0052d0c0 40b113d0 > 0x40b11140: 00b3 40b112f0 > 0x40b11150: 40b113d0 2aab08c2a108 > Instructions: (pc=0x2adb9e3f) > 0x2adb9e2f: 48 89 5d b0 49 8b 55 0
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads
[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779641#action_12779641 ] Paul Smith commented on LUCENE-2075: bq. This cache impl should be able to support 1B operations per second for almost 300 years (i.e. the time it would take to overflow a long). Hopefully Sun has released Java 7 by then. :) > Share the Term -> TermInfo cache across threads > --- > > Key: LUCENE-2075 > URL: https://issues.apache.org/jira/browse/LUCENE-2075 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, > LUCENE-2075.patch, LUCENE-2075.patch > > > Right now each thread creates its own (thread private) SimpleLRUCache, > holding up to 1024 terms. > This is rather wasteful, since if there are a high number of threads > that come through Lucene, you're multiplying the RAM usage. You're > also cutting way back on likelihood of a cache hit (except the known > multiple times we lookup a term within-query, which uses one thread). > In NRT search we open new SegmentReaders (on tiny segments) often > which each thread must then spend CPU/RAM creating & populating. > Now that we are on 1.5 we can use java.util.concurrent.*, eg > ConcurrentHashMap. One simple approach could be a double-barrel LRU > cache, using 2 maps (primary, secondary). You check the cache by > first checking primary; if that's a miss, you check secondary and if > you get a hit you promote it to primary. Once primary is full you > clear secondary and swap them. > Or... any other suggested approach? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515882 ] Paul Smith commented on LUCENE-966: --- We did pretty much the same thing here at Aconex, The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs. Nice work! > A faster JFlex-based replacement for StandardAnalyzer > - > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]