Lucene Taglib
Hi, I've worked on a bit on the taglib and added an index and field tag for basic indexing capability, though I don't think it's really useful, apart from, in my case quick prototyping of web applications. What do you guys think? I'm new to Lucene and taglibs so I may have missed out lots of things. For the curious, you see the 'in progress' examples and docs at http://www.javaxp.net/lucene-examples/ and http://www.javaxp.net/lucene-doc/ resp. or download the distribution http://www.javaxp.net/lucene-taglib/lucene-taglib.zip or http://www.javaxp.net/lucene-taglib/lucene-taglib.tar.gz Erik, is there any requirements for the java package names? e.g. ... to be named as org.apache.lucene.taglib etc. BTW, I've included the ASL 2.0 license in the source files. Regards, Iskandar - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Sunday, March 07, 2004 11:18 AM Subject: Re: Lucene Search Taglib I, too, gave up on the sandbox taglib. I apologize for even committing it without giving it more of a workout. I gave a good effort to fix it up a couple of months ago, but there was more work to do than I was willing to put in. I have not heard from the original contributor, and I specifically asked on the list for assistance with getting it cleaned up. I would gladly throw away what is in the sandbox for your code. If your code is designated as ASL 2.0 on all the files per the Apache licensing guidelines and you wish to donate it to the sandbox, just say the word. Erik On Mar 6, 2004, at 9:32 PM, Iskandar Salim wrote: Hi, I've written a taglib for querying lucene indices and have uploaded the taglib at http://blog.javaxp.net/files/lucene-taglib.zip for anyone wanting to check it out. It's a hefty 903kb as it includes the Lucene libraries and a sample index :P . There's a demo at http://www.javaxp.net/lucene-taglib/ Anyway, I could not get the current lucene taglib from the cvs to work as expected and gave up trying to modify it and getting it to work, so I wrote a new one, my very first taglib :P, with ideas and code borrowed/copied from the JSTL taglib. I've tested the taglib on Tomcat 4.1.18 and Tomcat 5.1.19 on JRE 1.4.2 I'll be making a few enhancements/cleanup/docs these few days and would greatly appreciate any feedback/ideas on features that the taglib should have and whether the taglib is done right at all. Thanks Regards, Iskandar Salim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sys properties Was: java.io.tmpdir as lock dir .... once again
Hi, Here is the indexing performance testing result for the two index formats. 1000 megahertz Intel Pentium III (2 installed) 32 kilobyte primary memory cache 256 kilobyte secondary memory cache SCSI Hard drive 145.45 GB RAm 3G Windows 2000 Advanced Server, Service Pack 2 JDK 140 JVM memory 512m Indexed files: local 66100 local text files around 400m Index time: compound format is 89 seconds slower. compound format: 1389507 total milliseconds non-compound format: 1300534 total milliseconds The index size is 85m with 4 fields only. The files are stored in the index. The compound format has only 3 files and the other has 13 files. Search Time (with only top 10 retrieved, no indexing at the same time, only one thread search, indices are optimized and opened) Do not see too much constant difference for the simple situation. compound format: Query: iraq 4275 total within(ms) 110 Query: war 5728 total within(ms) 0 Query: iraq AND war 3182 total within(ms) 16 non-compound format: Query: war 5728 total within(ms) 125 Query: iraq war 6821 total within(ms) 31 Query: iraq AND war 3182 total within(ms) 0 -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 04, 2004 11:54 AM To: Lucene Users List Subject: Re: Sys properties Was: java.io.tmpdir as lock dir once again hui wrote: Not yet. For the compound file format, when the files get bigger, if I add few new files frequently, the bigger files has to be updated. Will that affect lot on the search and produce heavier disk I/O compared with the traditional index format? It seems OS cache makes quite difference when the files not changed differently. The compound format slows indexing performance slightly, but should not affect search performance much. It radically reduces the number of file handles used when searching, by a factor of eight or more, depending on how many indexed fields you have. Perhaps the compound format should be the default format in 1.4. Can folks provide any benchmarks for how it affects performance? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sys properties Was: java.io.tmpdir as lock dir .... once again
hui wrote: Hi, Here is the indexing performance testing result for the two index formats. A shameless plug: you can use Luke (http://www.getopt.org/luke) to convert the same index between compound/non-compound formats. Which could be useful to rule out any possible differences in the indexing/inserting process between the runs. Luke provides you also with a simple time measurement for query execution. Just FYI. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sys properties Was: java.io.tmpdir as lock dir .... once again
Thank you, the converting option from Luke is really helpful for migrate existing user index. Regards, Hui -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 10:57 AM To: Lucene Users List Subject: Re: Sys properties Was: java.io.tmpdir as lock dir once again hui wrote: Hi, Here is the indexing performance testing result for the two index formats. A shameless plug: you can use Luke (http://www.getopt.org/luke) to convert the same index between compound/non-compound formats. Which could be useful to rule out any possible differences in the indexing/inserting process between the runs. Luke provides you also with a simple time measurement for query execution. Just FYI. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
Erik Hatcher wrote: private static final DecimalFormat formatter = new DecimalFormat(0); // make this as wide as you need For ints, ten digits is probably safest. Since Lucene uses prefix compression on the term dictionary, you don't pay a penalty at search time for long shared prefixes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sys properties Was: java.io.tmpdir as lock dir .... once again
hui wrote: Index time: compound format is 89 seconds slower. compound format: 1389507 total milliseconds non-compound format: 1300534 total milliseconds The index size is 85m with 4 fields only. The files are stored in the index. The compound format has only 3 files and the other has 13 files. Thanks for performing this benchmark! It looks like the compound format is around 7% slower when indexing. To my thinking that's acceptable, given the dramatic reduction in file handles. If folks really need maximal indexing performance, then they can explicitly disable the compound format. Would anyone object to making compound format the default for Lucene 1.4? This is an incompatible change, but I don't think it should break applications. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Caching and paging search results
Hi all, could someone describe his expirience in implementation of caching, sorting and paging search results. Is Stateful Session bean appropriate for this? My wish is to obtain all search hits only in first call, and after that, to iterate through Hit Collection and display cached results. I have checked SearchBean in contribution section, but it does not provide real caching and paging. Regards and thanx in advance! Milan ___ Yahoo! Messenger - Communicate instantly...Ping your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching and paging search results
In the RealWorld... many applications actually just re-run a search and jump to the appropriate page within the hits searching is generally plenty fast enough to alleviate concerns of caching. However, if you need to cache Hits, you need to be sure to keep around the originating IndexSearcher as well. A stateful session bean could be used, but I'd opt for a much simpler solution as a first pass, such as the first point of just re-running a search from scratch. Erik On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote: Hi all, could someone describe his expirience in implementation of caching, sorting and paging search results. Is Stateful Session bean appropriate for this? My wish is to obtain all search hits only in first call, and after that, to iterate through Hit Collection and display cached results. I have checked SearchBean in contribution section, but it does not provide real caching and paging. Regards and thanx in advance! Milan ___ Yahoo! Messenger - Communicate instantly...Ping your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sys properties Was: java.io.tmpdir as lock dir .... once again
I tend to agree (but with the same uncertainty as to why I feel that way). Regards, Terry - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, March 08, 2004 2:34 PM Subject: Re: Sys properties Was: java.io.tmpdir as lock dir once again I can't explain why, but I feel like the old index format should stay by default. I feel like I'd rather a (slightly) faster index, and switch to the compound one when/IF I encounter problems, than have a safer, but slower index, and never realize that there is a faster option available. Weak argument, I know, but some instinct in me thinks that the current mode should remain. Otis --- Doug Cutting [EMAIL PROTECTED] wrote: hui wrote: Index time: compound format is 89 seconds slower. compound format: 1389507 total milliseconds non-compound format: 1300534 total milliseconds The index size is 85m with 4 fields only. The files are stored in the index. The compound format has only 3 files and the other has 13 files. Thanks for performing this benchmark! It looks like the compound format is around 7% slower when indexing. To my thinking that's acceptable, given the dramatic reduction in file handles. If folks really need maximal indexing performance, then they can explicitly disable the compound format. Would anyone object to making compound format the default for Lucene 1.4? This is an incompatible change, but I don't think it should break applications. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Filtering out duplicate documents...
I'm looking for a way to filter out duplicate documents from an index (either while indexing, or after the fact). It seems like there should be an approach of comparing the terms for two documents, but I'm wondering if any other folks (i.e. nutch) have come up with a solution to this problem. Obviously you can compute the Levenstein distance on the text, but that is way too computationally intensive to scale. So the goal is to find something that would be workable in a production system. For example, a given NYT article, and its printer friendly version should be deemed to be the same. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Filtering out duplicate documents...
that kind of fuzzy equality is an area of open research. you need to define what is an acceptable error rate for Type 1 and Type 2 errors before you can think about implementations that scale better. approaches range from identifying document vocabulary and statistics to raw hashing of the input text. Herb... -Original Message- From: Michael Giles [mailto:[EMAIL PROTECTED] Sent: Monday, March 08, 2004 4:38 PM To: Lucene Users List Subject: Filtering out duplicate documents... Obviously you can compute the Levenstein distance on the text, but that is way too computationally intensive to scale. So the goal is to find something that would be workable in a production system. For example, a given NYT article, and its printer friendly version should be deemed to be the same. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering out duplicate documents...
My impression is the new term vector support should at least make this type of comparison feasible in some manner. I'd be interested to see what you come up with if you give this a try. You will need the latest CVS codebase. Erik On Mar 8, 2004, at 4:37 PM, Michael Giles wrote: I'm looking for a way to filter out duplicate documents from an index (either while indexing, or after the fact). It seems like there should be an approach of comparing the terms for two documents, but I'm wondering if any other folks (i.e. nutch) have come up with a solution to this problem. Obviously you can compute the Levenstein distance on the text, but that is way too computationally intensive to scale. So the goal is to find something that would be workable in a production system. For example, a given NYT article, and its printer friendly version should be deemed to be the same. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
which query matched in a Boolean query
I have a BooleanQuery that takes 3 TermQueries for example (title:colombo OR txt:colombo OR city:colombo) I would like to mark hits that match in the field title in red on display, txt in blue, and city in green. and maybe those that match in 2 fields in another color is this possible? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching and paging search results
On Monday 08 March 2004 12:34, Erik Hatcher wrote: In the RealWorld... many applications actually just re-run a search and jump to the appropriate page within the hits searching is generally plenty fast enough to alleviate concerns of caching. However, if you need to cache Hits, you need to be sure to keep around the originating IndexSearcher as well. Further, oftentimes search index only contains key to actual content indexed (which itself is stored as a file, in database, or so)... so it's enough to cache just set of such ids, not actual search result objects. And assuming ids are simple (int id, short String), such information can be stored in, say, user session. In system I'm working on, we store up to 500 hits, only storing document id (int) and hit quality (byte), stored in session. -+ Tatu +- A stateful session bean could be used, but I'd opt for a much simpler solution as a first pass, such as the first point of just re-running a search from scratch. Erik On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote: Hi all, could someone describe his expirience in implementation of caching, sorting and paging search results. Is Stateful Session bean appropriate for this? My wish is to obtain all search hits only in first call, and after that, to iterate through Hit Collection and display cached results. I have checked SearchBean in contribution section, but it does not provide real caching and paging. Regards and thanx in advance! Milan ___ Yahoo! Messenger - Communicate instantly...Ping your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DocumentWriter, StopFilter should use HashMap... (patch)
I'm looking at StopFilter.java right now... I did a kill -3 java and a number of my threads were blocked here: ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for monitor entry [b9bff000..b9bff8d0] at java.util.Hashtable.get(Hashtable.java:332) - waiting to lock 0x61569720 (a java.util.Hashtable) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94) at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:170) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:111) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java:136) at ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java:331) Is there ANY reason to keep this as a Hashtable? It's just preventing inversion across multiple threads. They all have to lock on this hashtable. Note that this guy is initialized ONCE and no more puts take place so I don't see why not. It's readonly after the StopFilter is created. I think this might really end up speeding up indexing a bit. No hard benchmarks yet though. Right now though it's just an inefficiency that should be removed. I've attached a quick implementation. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes stop words from a token stream. */ public final class StopFilter extends TokenFilter { //Note: this could migrate to using a HashSet private HashMap table; /** Constructs a filter which removes words from the input TokenStream that are named in the array of words. */ public StopFilter(TokenStream in, String[] stopWords) { super(in); table = makeStopTable(stopWords); } /** Constructs a filter which removes words from the input
Re: DocumentWriter, StopFilter should use HashMap... (patch)
I don't see any reason for this to be a Hashtable. It seems an acceptable alternative to not share analyzer/filter instances across threads - they don't really take up much space, so is there a reason to share them? Or I'm guessing you're sharing it implicitly through an IndexWriter, huh? I'll away further feedback before committing this change, but seems reasonable to me. Erik On Mar 8, 2004, at 8:50 PM, Kevin A. Burton wrote: I'm looking at StopFilter.java right now... I did a kill -3 java and a number of my threads were blocked here: ksa-task-thread-34 prio=1 tid=0xad89fbe8 nid=0x1c6e waiting for monitor entry [b9bff000..b9bff8d0] at java.util.Hashtable.get(Hashtable.java:332) - waiting to lock 0x61569720 (a java.util.Hashtable) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:94) at org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.ja va:170) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java: 111) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:257) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at ksa.index.AdvancedIndexWriter.addDocument(AdvancedIndexWriter.java: 136) at ksa.robot.FeedTaskParserListener.onItemEnd(FeedTaskParserListener.java: 331) Is there ANY reason to keep this as a Hashtable? It's just preventing inversion across multiple threads. They all have to lock on this hashtable. Note that this guy is initialized ONCE and no more puts take place so I don't see why not. It's readonly after the StopFilter is created. I think this might really end up speeding up indexing a bit. No hard benchmarks yet though. Right now though it's just an inefficiency that should be removed. I've attached a quick implementation. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster package org.apache.lucene.analysis; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * This product includes software developed by the *Apache Software Foundation (http://www.apache.org/). *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names Apache and Apache Software Foundation and *Apache Lucene must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called Apache, *Apache Lucene, nor may Apache appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * http://www.apache.org/. */ import java.io.IOException; import java.util.*; /** Removes
Re: Lucene Taglib
Thanks for the tips and comments. Regards, Iskandar - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, March 08, 2004 7:48 PM Subject: Re: Lucene Taglib On Mar 8, 2004, at 3:46 AM, Iskandar Salim wrote: I've worked on a bit on the taglib and added an index and field tag for basic indexing capability, though I don't think it's really useful, apart from, in my case quick prototyping of web applications. What do you guys think? I'm new to Lucene and taglibs so I may have missed out lots of things. I don't think a taglib is a useful place to put indexing code. Your mileage may vary, but there are so many flags to control (field type, analyzer, boost, etc) that it is more cleanly done directly with the Lucene API. For the curious, you see the 'in progress' examples and docs at http://www.javaxp.net/lucene-examples/ and http://www.javaxp.net/lucene-doc/ Nice work fleshing out documentation! Erik, is there any requirements for the java package names? e.g. ... to be named as org.apache.lucene.taglib etc. Yes, that package name is the best one probably. BTW, I've included the ASL 2.0 license in the source files. Thanks! A few comments/suggestions: - What if I wanted an index to live in a RAMDirectory and have it live in application scope? My suggestion here is instead of using a path for the index, use a Directory. This allows greater freedom for the developer, and it should be pretty easy to craft a JSTL expression to wrap a string path into an FSDirectory (I don't know JSTL, but if it cannot do this then I'm disappointed - I'm in the Tapestry/OGNL world myself, where it would be trivial). - Or, perhaps you may want a long-lived IndexSearcher so that a Directory is only needed to construct the IndexSearcher? - I haven't looked at your code, but is 'keywords' passed directly to QueryParser? If so, perhaps that should be renamed 'query' instead since keywords is more domain-specific and has sort of a special meaning in Lucene as a Field.Keyword - What about allowing specification of an Analyzer? Look at how this is done in the sandbox contributions/ant area in IndexTask. I allowed the user to specify high level strings like 'whitespace', 'stop', 'standard', etc. as well as a fully-qualified classname. I can only assume you have it hardcoded to use a particular analyzer, which is not going to be generally useful. - It would also be nice if you allowed for an optional filter to be specified - in this case I think it would probably suffice to just allow a Filter instance to be passed in rather than the taglib itself constructing one. This allows capabilities like search-within-search and more. - What is the 'content' attribute for the search tag? Is that the default field? If so, again, I think it would be best to named the similarly to the Lucene terminology - just call it 'field', or 'defaultField'. - SortedMap? What are you sorting on? Is count necessary since you can just ask the map what its size is? In general it looks fine though, although I cringe seeing the amount of code your examples have in it with all the scriptlet junk. It seems quite yucky to me given that I'm now in the elegant Tapestry world where I could hide the *entire* tag in an HTML template with something like this: table jwcid=results/ and no, I'm not kidding, and yes, there would be more behind the scenes but separate from the view. And the example includes all the paging controls. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Taglib
On Mar 8, 2004, at 10:21 PM, Iskandar Salim wrote: Thanks for the tips and comments. Also, there was a big smiley implicit in my JSP taglib rantings below. Certainly no offense intended. I've paid my Struts/taglib dues and am now deep into a completely different web development paradigm that I find quite enjoyable and refreshing. Your taglib is a nicely done. Erik Regards, Iskandar - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, March 08, 2004 7:48 PM Subject: Re: Lucene Taglib On Mar 8, 2004, at 3:46 AM, Iskandar Salim wrote: I've worked on a bit on the taglib and added an index and field tag for basic indexing capability, though I don't think it's really useful, apart from, in my case quick prototyping of web applications. What do you guys think? I'm new to Lucene and taglibs so I may have missed out lots of things. I don't think a taglib is a useful place to put indexing code. Your mileage may vary, but there are so many flags to control (field type, analyzer, boost, etc) that it is more cleanly done directly with the Lucene API. For the curious, you see the 'in progress' examples and docs at http://www.javaxp.net/lucene-examples/ and http://www.javaxp.net/lucene-doc/ Nice work fleshing out documentation! Erik, is there any requirements for the java package names? e.g. ... to be named as org.apache.lucene.taglib etc. Yes, that package name is the best one probably. BTW, I've included the ASL 2.0 license in the source files. Thanks! A few comments/suggestions: - What if I wanted an index to live in a RAMDirectory and have it live in application scope? My suggestion here is instead of using a path for the index, use a Directory. This allows greater freedom for the developer, and it should be pretty easy to craft a JSTL expression to wrap a string path into an FSDirectory (I don't know JSTL, but if it cannot do this then I'm disappointed - I'm in the Tapestry/OGNL world myself, where it would be trivial). - Or, perhaps you may want a long-lived IndexSearcher so that a Directory is only needed to construct the IndexSearcher? - I haven't looked at your code, but is 'keywords' passed directly to QueryParser? If so, perhaps that should be renamed 'query' instead since keywords is more domain-specific and has sort of a special meaning in Lucene as a Field.Keyword - What about allowing specification of an Analyzer? Look at how this is done in the sandbox contributions/ant area in IndexTask. I allowed the user to specify high level strings like 'whitespace', 'stop', 'standard', etc. as well as a fully-qualified classname. I can only assume you have it hardcoded to use a particular analyzer, which is not going to be generally useful. - It would also be nice if you allowed for an optional filter to be specified - in this case I think it would probably suffice to just allow a Filter instance to be passed in rather than the taglib itself constructing one. This allows capabilities like search-within-search and more. - What is the 'content' attribute for the search tag? Is that the default field? If so, again, I think it would be best to named the similarly to the Lucene terminology - just call it 'field', or 'defaultField'. - SortedMap? What are you sorting on? Is count necessary since you can just ask the map what its size is? In general it looks fine though, although I cringe seeing the amount of code your examples have in it with all the scriptlet junk. It seems quite yucky to me given that I'm now in the elegant Tapestry world where I could hide the *entire* tag in an HTML template with something like this: table jwcid=results/ and no, I'm not kidding, and yes, there would be more behind the scenes but separate from the view. And the example includes all the paging controls. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Taglib
- Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, March 09, 2004 11:51 AM Subject: Re: Lucene Taglib Also, there was a big smiley implicit in my JSP taglib rantings below. Certainly no offense intended. None taken. :) I've paid my Struts/taglib dues and am now deep into a completely different web development paradigm that I find quite enjoyable and refreshing. Heard too many good things about Tapestry. Have to learn it some day ;) Regards, Iskandar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]