Re: Lucene vs. in-DB-full-text-searching
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote: I can comment on this since I'm in the middle of excising Oracle text searching and replacing it with Lucene in one of my projects. Intereseting, particularly as it's from somebody who's already tried an existing in-db fulltext search feature. All in all, I don't think that a JDBC wrapper is going to do what you want. I wasn't thinking about trying to do the whole thing under the JDBC driver. Mainly I was thinking that one key point is that you need to treat the lucene index somewhat like a cache. This also means that you have to watch database writes and make sure you update your cache, which means you have to have some sort of single point of data access to monitor. Well, we already have that - it's called the JDBC driver. The general design I was eyeing speculatively is basically that the driver would be set up with a reference to an object that implements a CacheManager interface. This interface basically gives the driver a way to notify the cache manager of when certain tables and columns are being edited. Exactly how is another question. I don't know enough of the innards of, say, a PreparedStatement, to say more. It could be as simple as sending the CacheManager a copy of every SQL query string and letting the CacheManager figure out the rest. Ideally I'd like it to be a little bit more structured. From there, it's the CacheManager's job to decide what to do about it, and how to do it. This leaves the tricky issue of mapping from a specific database to a specific lucene index up to the developer. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene vs. in-DB-full-text-searching
Hi, I was rambling to some friends about an idea to build a cache-aware JDBC driver wrapper, to make it easier to keep a lucene index of a database up to date. They asked me a question that I have to take seriously, which is that most RDBMSes provide some built-in fulltext searching - postgres, mysql, even oracle - why not use that instead of adding another layer of caching? I have to take this question seriously, especially since it reminds me a lot of what Doug has often said to folks contemplating doing similar things (caching query results, etc) with Lucene. Has anybody done some serious investigation into this, and could summarize the pros and cons? -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Collaborative Filtering API
On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote: Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota. I've actually worked on a system that bundled GroupLens. I think it was Vignette StoryServer. The Vignette docs were incredibly dense with MarketingNewSpeak, so I could never quite figure out what they said GroupLens actually *did* (not at a web-capable terminal right now, or I'd just google it). Collaborative filtering in general is a topic I'm interested in, and is why I first got into Lucene. I wanted and still want to build a collaborative filtering search engine for mailing lists and the like. I do remember that FireFly's engine was supposed to graph all of the users' ratings on a topic in an N-dimensional space, and then find users close to the same user in that N-dimensional space, and suggest topics that they'd liked, but that the current user hadn't rated. I'm interested in more of a free market sort of approach than in statistical analysis; I want to build a system that helps usrs express their opinions, then nurture an emerging consensus. My experience has been that systems that systems/technologies that try to facilitate the way users already do things, instead of replacing them with new ways of doing things, tend to work better. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wed, Sep 17, 2003 at 08:00:42AM -0400, Erik Hatcher wrote: I'm about to start some refactorings on the web application demo that ships with Lucene to show off its features and be usable more easily and cleanly out of the box - i.e. just drop into Tomcat's webapps directory and go. Does anyone have any suggestions on what they'd like to see in the demo app? One odd thought (may be out of scope) is to put together a google-flavored query language, since most users are going to be unfamiliar with the default Lucene query language. Lucene doesn't really match google, but something google-flavored might be better at showing off Lucene's features in the demo. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote: Lucene Users List [EMAIL PROTECTED] I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB in size. They don't ever change, and are on a CD-ROM. Each file contains a bunch of small documents. I just create one index for all 4 of them. These documents are for an association that I belong to - they contain a history of the association's documents - and my application allows you to search them. Well, aside from your concerns about the second list, Lucene seems perfect for your needs. You'd parse apart the four big files into a bunch of small documents, the parse those small documents and create lucene Documents, containing Fields, and add them to the index. They are actually currently indexed by an application called 'Sonar', by Virginia Systems. But I REALLY didn't like using their user interface - blech - so I decided to write a new interface for my own use. But Sonar costs some real bucks to be able to develop against their search API, so I found Lucene, and decided to go with it. Here are the search features that 'Sonar' has : Boolean Searching Proximity Searching Wild Card Searching Field/Block Searching I'm not sure what Field/Block means. Boolean, Proximity and WildCard, are pretty typical in Lucene searches. You should probably take a look at the Query Parser syntax docs: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Relevancy Ranking / Date Ranking Lucene search results are typically ranked by relevance, and you can tweak the search to adjust this (there's a fair bit of discussion of this in the lucene-user archives, a good keyword to look for is slop and boost). Sorting output by date might take some finesse. I haven't played with sorting by date, but I'd expect to handle that by directly instantiating a QueryTerm to indicate the date issues. List of Occurrences in Context I assume here that you mean displaying the results with a little snapshot of the text around it. There have been discussions about how best to do this (often focused around highlighting the search terms in the displayed text) on the lucene-users list. Check the list archive. Phonetic Searching I'd guess you need to build this one yourself, perhaps by using a soundex algorithm when indexing the original data files. Synonyms/Concepts Likewise... you'd need to come up with some sort of ontology of synonyms and concepts, then parse the fields you're indexing and generate a synonym/concept field that you'd add to the lucene Document. Relational Searching Associated Words Drill Down Search Narrowing I'm not sure what these three mean. I think that Lucene has all the features in the first group. How does it stack up against the second group ? I'm afraid I haven't been too helpful here. Perhaps if you clarify what the above mean, folks can post about how to implement it in Lucene. I'm writing the whole thing in Swing, which has been time consuming, and so have invested quite a bit of time into this project. But I'm seeing the end of the tunnel, and want to make sure that I'm going down the right path before I spend too much more time on it. It sounds like you ought to at least seriously consider using Lucene, if you can find or implement equivalent features, or decide you can live without them. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote: I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. I don't know what other search engines are available out there, Lucene isn't a search engine _application_, it's a search engine _API_. Lucene gives you what you need in order to build the search engine you want, instead of spending gobs of time trying to figure out the 10,000 options available for a search engine application, or trying to warp somebody else's ideas of what you need to meet what you really need. and how Lucene stacks up against them. Pretty well, if you're willing to put a (very) little time and energy into to building the application you need. I know. I've done it. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. There is no comparison :-). Lucene is a fundamentally decent piece of technology. This puts it head and shoulders above most commercial packages. Specifically, the Lucene search engine API is blindingly fast at searching and at indexing, and comes with several built-in packages to provide several of the commonly needed functions (like a web search engine style query language parser). Additionally, a wide variety of people have been down this road and done a wide variety of things with Lucene, so you're likely to be able to find examples, in the Lucene sandbox or in the lucene-user archives, of how to do whatever it is you want to do. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? Tell us a little more about your project requirements and I'll tell you enough specifics to give you a warm and fuzzy feeling. Lucene isn't perfect for _everything_ (and anybody who claims that a given technology *is* perfect for _everything_ is lying). But it's quite good for a number of things. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Apache Wiki at nagoya.apache.org
Hey folks, They've put an apache wiki at nagoya. I took the liberty of writing a paragraph about Lucene, please feel free to completely revise it :-). http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneProjectPages Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Let me get started
On Wed, Nov 13, 2002 at 05:14:25PM +0530, Uma Maheswar wrote: Thanks for the messages. Yes I wanted to index .jsp files also. Is it possible? It's possible, but you'll need to write code to select and parse the jsp files. There may be code in the sandbox area at jakarta.apache.org/lucene for doing this, though I don't see it. I thought we need a database to store some values and then retrive them back. Dont we need database for it? Nope, lucene stores search data in its own files. You can easily use lucene to build a search engine for data that's stored in a database, but you don't need a database. Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)
petite, On Mon, Apr 29, 2002 at 07:54:43PM +0200, petite_abeille wrote: As a final note, several people suggested to increase the number of file descriptors per process with something like ulimit... Just be glad you aren't doing this on Solaris with JDK 1.1.6, where I first ran into ulimit issues - back when I encountered this problem, the solaris default ulimit setting was 24 files, and JDK 1.1.6 reported the problem as an OutOfMemory error! Looks like things are improving :-). From what I learned today, I think it's a *bad* idea to have to change some system parameters just because your/my app is written in such a way that it may run out of some system resources. Your/my app has to fit in the system. Hacking ulimit and/or other system parameters is just a quick patch that will -at best- delay dealing with the real problem that's usually one of design. Yes and no. Setting ulimit to a reasonable number of open files is not only not a patch, it's the right way to do it. I understand where you're coming from, really, and in a certain way, it makes sense, BUT... sometimes the impulse for clean, good design takes you too far down a blind alley. Sometimes there is no elegant solution. Sometimes there is no best way, only one of a limited set of options with different tradeoffs. By definition, Lucene is an application that trades off up front CPU (for indexing) and file resources (for storage) for request-time speed. The OS's job is to manage resources, and open files are one of those resources. That's the tradeoff here, and it's reasonable and expected. Most serious applications have to have some sort of OS variable tweaking, you're just used to having it done invisibly and painlessly. That said, since you're working on a client/desktop application, not a server application, you need to think about ways to handle this: You could figure out the right way to set the system configuration on install or launch. You could look at the alternative techniques for indexing in Lucene, and see if any approaches there can help - for example, maybe doing a lot of the more intense indexing work in a RAMDirectory, then merging it into a normal file-based Directory. You could look more closely at what your application is doing, and see if there's anything you're doing wrong (perhaps opening files and not closing them, and leaving them for the garbage collector to eventually get around to closing?) or if you have a pessimal usage pattern that exacerbates the situation. You could take a closer look at the lucene indexing and file management stuff, and see if you can come up with a scheme to run Lucene indexing with modified code for keeping track of file resources. I'll bet Doug and the other developers would rather not add open-file managmeent as a main, permanent part of lucene, since it would add overhead to all uses of lucene just to deal with an anomalous situation (use on a client/desktop machine). But they might be interested in a way to offer it as an optional feature, where people using lucene in a constrained environment could configure lucene to be careful about how many files it keeps open at any given time. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: rc4 and FileNotFoundException: an update
On Fri, Apr 26, 2002 at 07:05:23PM +0200, petite_abeille wrote: I guess it's really not my day... [...] Well, it's pretty ugly. Whatever I'm doing with Lucene in the previous package (com.lucene) is magnified many folds in rc4. After processing a paltry 16 objects I got: SZFinder.findObjectsWithSpecificationInStore: java.io.FileNotFoundException: _2.f14 (Too many open files) Sounds like a pretty nasty situation. One suggestion I have for you is that Doug is usually very helpful with problems like this IF you can first narrow down what is happening to the point that you can post a clear, specific, isolated test that consistently causes the problem to happen. This makes sense - any effort to solve the problem will first involve isolating the bug, and that's a task you're best suited for, since you know your system best. So maybe your best approach would be to take a copy of your system as above, and start gradually stripping out stuff, testing between each run, until you have most of the application-specific stuff removed, but the problem is still reoccurring consistently. Then post your code and ask if some of the more lucene-knowledgable can take a look. Re: index integrity, I agree that it would be really, really nice to have some sort of sanity check. I have yet to actually get into the internals of the index, but I'd guess that there must be some sort of at least superficial check, maybe some sort of format check. If I was going to kludge something together, the first approach I'd take would be to just open the index and roll through all of the Documents in it, accessing all of the fields (or maybe just a few main fields per Document). Im not sure what I'd *do* with the field values (printing them out to the screen might take a while), other than perhaps checking for nulls. But I suspect that if the code gets throught that without causing an exception or getting null values, then at least the index's internal format is intact. Maybe the test code could save the number of lucene Document objects in the index in between checks (and, of course, update this number when you add or remove documents), and make sure it still has the right number of documents. As for repairing an index, I think that's working sort of against the grain of Lucene. In your case, it sounds like rebuilding the index is important, because you're using Lucene as a data store. I have some similar issues myself in some things I want to build (I end up wanting both a data store and a search index; ultimately I've ended up choosing to have a separate data store for the extra data). But Lucene is a search index, meant to be used more in a cache-like style, so there's an underlying assumption that the original data is always around to reindex. Thus, repairing an index is less important, since it is assumed you can always rebuild it. I don't know much of the theories behind data store systems. It occurs to me that using Lucene as a data store, you'll always be working against the grain, always swimming upstream. Maybe it'd be a better idea to figure out some way to use Lucene as the indexing technology in a data store, the way traditional RDBMSes use indexes, for speeding access. Or possibly you should look at Xindice (http://xml.apache.org/xindice/) which is an XML database. You might find it easier to adapt that to your needs. I'm kind of curious as to how fast Xindice's XPath execution is, and what their indexing is based on - there might be a use for Lucene there. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Relevance Feedback
On Fri, Mar 29, 2002 at 12:11:03PM -0800, Joshua O'Madadhain wrote: On Fri, 29 Mar 2002, Nathan G. Freier wrote: I'm just beginning to plan out some mechanisms for query expansion and relevance feedback. I've also been doing research in IR using the Lucene API. [...] If you'd like to discuss this offline (since we may be getting off the list topic), feel free to email me. I'm curious about this topic, although I have absolutely no familiarity with it (beyond reading, many years ago, about the real estate browsing UI experiment where they let users click on inappropriate listings and refined the search based on that - a feature I've often wished for with web search engines). If you could either include me in the CC list, or send me a summary, or possibly (if others are also interested), continue the discussion here, I'd appreciate it. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Question on the FAQ list with filters
On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote: From the FAQ: 16. What is filtering and how is it performed ? * Search Query - in this approach, provide your custom filter object to the when you call the search() method. This filter will be called exactly once to evaluate every document that resulted in non zero score. * Selective Collection - in this approach you perform the regular search and when you get back the hit list, collect only those that matches your filtering criteria. In this approach, your filter is called only for hits that returned by the search method which may be only a subset of the non zero matches (useful when evaluating your search filter is expensive). *** I don't see why the second way is useful. Yes, your filter is called only for hits that got returned by the search method, but aren't those the same hits that the search() method would run through the filter? Maybe I'm just not reading it close enough. Is my assumption that it is faster to provide a filter to the search() method, than to do a selective collation correct? It Depends. That's more or less the point of the FAQ answer, though it could be more clearly expressed. The gist of the FAQ seems to be that you can either do the filtering BEFORE you do the search, or AFTER you do the search. Obviously the question is, which is more expensive, filtering out inappropriate documents, or searching for the possible hits? If filtering is cheaper, you do the filtering first, then do the search. If filtering is expensive, you do the search first, then do the filtering. You should also factor in which is more restrictive - will either the filter or the search drop out a large number of the documents? If you can arrange it so one is both cheaper and drops out the majority of the documents, you win. In either case, you implement some sort of object which you can hand a org.apache.lucene.TermDocs and get back a yes or no as to whether it's a valid possible search result. From looking at the source for: org.apache.lucene.search.Filter, org.apache.lucene.search.DateFilter, and org.apache.lucene.search.IndexSearcher, ...it appears that you instantiate your Filter subclass, then for filtering BEFORE the search, you pass YourFilter an IndexReader and get back a BitSet. Or more to the point, when you invoke IndexSearcher.search(), you pass it YourFilter, and a HitsCollector, and IndexSearcher.search() gets the BitSet from YourFilter. A BitSet, from the JDK API, is a vector of bit values (i.e. 1 or 0, corresponding to the java boolean values true and false). It appears, from looking at the source, that each Bit in the BitSet corresponds to an SearchIndex TermDoc at the same sequential location in the SearchIndex. IndexSearcher.search() has an inner class (this is a bit ambiguous and it's been a year since I've lookd at inner classes, so I'm going to just handwave and move along :-) with a collect() method that loops through the termDocs, skipping the ones for which BitSet.get() returns false. I'm not sure exactly how you would use an org.apache.lucene.search.Filter to do the filtering AFTER, but presumably that would involve just handing it the TermDocs in question, or maybe IndexReader and Hits both implement a common interface... uhm, no, that's not it. Well, I guess you use your own class for the filter. That's what I ended up doing anyway, in my ignorance of the Filter abstract class. I ended up doing my filtering AFTER, btw, because it involved some expensive lookups in other documents. There's actually a third option, figure out a way to implement your filter as an additional boolean phrase on your search. However, that may or may not be feasible, or the Lucene Filter mechanism may not have been intended to address such cases. To be honest, the design of the Filter seems less well-thought-out than the rest of Lucene, like it's an afterthought. I really oughta join the developers list, I guess, so I can put my money where my mouth is, and submit changes to clarify the docs, etc, when I go roaming through the source. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing Dynamically Generated Web Pages
Paul Friedman [EMAIL PROTECTED] asks: I am a novice developer researching Lucene for use on a web site that primarily uses JSPs. How do you index dynamically generated web pages with Lucene? Or is it even possible? The JSPs themselves don't have searchable data, only methods to get that data. When parsing these files for indexing, the necessary data for the search would not yet be in the page. Lucene doesn't do any crawling for you - it's your responsibility to write the code that crawls your website, i.e. selects the data sources to be indexed, chooses how to convert them to Lucene Documents, and adds them to the indexes. I suspect you're assuming you would use the demo application. That would not be appropriate for your project. Instead, what you should do is: 1) write some java code that walks through your data source - i.e. if your data source is a database, it would select each row in the appropriate table - and composes a Lucene Document, and adds it to your index. 2) write some servlets/jsps that do the search, and when it gets to the point where you would redirect the user's browser off to the appropriate HTML page, you either: a) invoke the appropriate JSP in your site with appropriate arguments or b) compose a page dynamically, roughly the same way the JSP page would compose it, and return it to the user. or c) redirect the user to a new JSP page, which you will have to write, which looks much like your old JSP page, but is designed to be invoked with an argument specifying what data to load. You may find http://darksleep.com/puff/lucene/lucene.html useful in further clarifying this. There is also a getting started article at http://jakarta.apache.org/lucene/docs/index.html that may be of some use. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: excluding files / refining search
Brian Rook [EMAIL PROTECTED] writes: The site I'm working on has a lot of small html files that are used for page construction (nav bars, footers, etc) and they're being returned high in the results because they contain the search term(s) I'm looking for and are small so they rank higher than larger documents. I want to exclude them from the index and I've come up with two ideas: 1) move them to a directory, which I will exclude from the index, but I'll have to change a bunch of links 2) detect them with some sort of flag and exclude them from the index. We were thinking that we could have a fake tag that lucene would detect and not index those pages. Why not just have an exclude list of some sort? In the code you wrote to select files for indexing, just have it check against a list of files you want to exclude. In the demo application, you would edit jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java The quick and dirty method would be to edit this section of code: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i files.length; i++) indexDocs(writer, new File(file, files[i])); } else { System.out.println(adding + file); writer.addDocument(FileDocument.Document(file)); } } To something like this: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i files.length; i++) indexDocs(writer, new File(file, files[i])); } else { if (checkFileName(file)) { System.out.println(skipping + file) ; } else { System.out.println(adding + file); writer.addDocument(FileDocument.Document(file)); } } } public static boolean checkFileName(File file) { String name = file.getName() ; if (name == footer.html || name == header.html || name == menu.html || name == navbar.html) { return false ; } return true ; } A more realistic implementation would use an exclude file of filenames to ignore, load them into a collection (probably a HashSet) and keep that collection around as an instance variable. Then checkFileName() just returns !excludedSet.contains(name). Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
eyebrowse database dependency?
Hi folks, I've been looking at Eyebrowse, an email archive indexing program built on top of Lucene. Eyebrowse also uses mysql to store database tables about messages (tables containing mbox filenames, message locations, authors, subjects, threads, date ranges, etc). When I was thinking about designing a similar store, I expected that most of the details could be stored as Lucene fields. Some of them might be a bit of a stretch, and certainly there would have to be at least a few non-Lucene details stored elsewhere (like the most recent mbox file and the location of the end of the most recently indexed message), but those seem, to my inexperienced eye, tolerable to avoid the overhead of an entire database added to the system I believe one of the eyebrowse developers is a member here. I'm curious as to the design factors that influenced choosing to use a database in addition to Lucene itself. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
Chantal, For what I know now, I had a bug in my own code. still I don't understand where these OutOfMemoryErrors came from. I will try to index again in one thread without RAMDirectory just to check if the program is sane. Java often has misleading error messages. For example, on solaris machines the default ulimit used to be 24 - that's 24 open file handles! Yeesh. This will cause an OutOfMemoryError. So don't assume it's actually a memory problem, particularly if a memory problem doesn't particularly make sense. Just a thought. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
I wrote: Java often has misleading error messages. For example, on solaris machines the default ulimit used to be 24 - that's 24 open file handles! Yeesh. This will cause an OutOfMemoryError. So don't Jeff Trent replied: Wow. I did not know that! I also don't see an option to increase that limit from java -X. Do you know how to increase that limit? That's used to be, I think it's larger on newer machines. I don't think there's a java command line option to set this, it's a system limit. The solaris command to check it is ulimit. To set it for a given login process (assuming sufficient privileges) use ulimit number (i.e. ulimit 128). ulimit -a prints out all limits. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Tutorial
Is anyone else out there creating a tutorial. I would be willing to compile and coordinate all the different parts for the people who want to contribute. I wrote something up a few months ago, and posted it to the list. I've been meaning to get back to it, edit it again, spiff it up a bit, maybe come up with some updated code for the examples written on my own time (instead of my employers) so I can publish it, etc. But right now, the current draft is at: http://darksleep.com/puff/lucene.html Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Context specific summary with the search term
Lee Mallabone wrote: Okay, I'm now not entirely certain how useful a generic solution will be to me, given the non-generic nature of the content I'm indexing. I think there a lot of optomizations I can make that wouldn't be generic. Early optimization is the root of all evil. Seriously, though, one thing I see Doug say often is that lucene's indexing and searching are designed to be extremely fast. He often responds to questions about odd details - for example, the classic do a search and cache the search results for paging across multiple web pages - by saying to just use the brute force approach and rely on the speed of the lucene index. I like to say, I assume that there are people out there with a lot more on the ball than me about things like optimization. I try to use their brains as much as possible :-). For example, with compilers, I assume the compiler writer knew a lot more about optimization than I do. People talk about the compiler not having the human judgement to know what's best. That's true, but the way to deal with that is not to try to hand-optimize my code and outguess the compiler (which will only will only confuse the compiler and prevent it from doing what it was designed to do). The compiler can best optimize the program if I focus on making it clear what my intent is, what the program is meant to do, in the structure of the code first. This leads to another optimization slogan that I remember reading - algorithmic optimization is much better than spot optimization. In other words, before you try to figure out a faster way to do something, figure out if you're doing the thing that accomplishes your true goal in the fastest way. And figure out how important that thing is in the grand scheme of things. Steven J. Owens [EMAIL PROTECTED]
Re: new Lucene release: 1.2 RC2
Sunil, Two weeks back I did have the problem which I stated. But I am unable to reproduce the results currently. I tested and retested but couldnt repeat the same. Doug have U guys fixed the issue long back itself ? (The only thing I have done fresh is to download the latest lucene-1.2-rc1.zip file and re-installed lucene - since it came along with source code) I believe what Sunil was trying to describe was: a) sucessfully created index b) did searches on index c) started to update index d) clicked exit from app before updating completed e) ran app again f) could not repeat searches from step b. Doug has recently mentioned several improvements in the past month or so. I'm looking forward to moving over to the new version, which is among other things thread safe. This sounds like the area where you were probably having problems, which means it's likely that Doug's changes fixed it. This is, by the way, why it's important to report problems early, ideally along with test data and code to reproduce it, or at least detailed descriptions of how to reproduce it... Steven J. Owens [EMAIL PROTECTED]
Re: Index Optimization: Which is Better?
Doug wrote: I'm having trouble getting a clear picture of your indexing scheme. I've been doing a lot of thinking about this same problem, so I may be a little more in tune with what Elliot's saying. By the way, Elliot, I'm very interested in your results. I considered the basic approach you're using, but I thought it was a bit extreme in terms of having zillions of tiny lucene Documents. I'm working on a quick kludge that may serve my immediate purposes (if it does, I'm planning to post the deatils here). Could you provide some simple examples, e.g., for the xml: tag1this is some text tag2and some other text/tag2 /tag1 would you have something like the following? doc1 node_type: tag1 contents: this is some text doc2 node_type: tag2 contents: and some other text doc3 node_type: all_contents contents: this is some text and some other text I think that's exactly what Elliot is intending. My first instinct would be to have something like: doc1 tag1: this is some text tag2: and some other text all-tags: this is some text and some other text What do you need that that does not achieve? Name collision - you can have multiple Elements at different levels, and you may have attributes and tags having the same name. Obviously one way around this is Don't do that, but that could get really tiresome, quickly. If you just conflate the elements and attributes under the same name (i.e. field blah contains a concatenated set of values from all occurrences of both elements and attributes) then your searches become much more limited in what you can specify. This is, by the way, the approach I'm trying out, with a second stage to refine the results and drop out false positives. But I'll have to wait on saying any more about that. All of this, of course, is in the context of having arbitrary XML documents. If you have predefined XML schemas then you can hand-code the mappings from elements to lucene document fields. But then you trade a heck of a lot of flexibility for a lot of maintenance. Steven J. Owens [EMAIL PROTECTED]