Re: Index Locking Issues Resolved...I hope
I was thinking that perhaps I can pre-stem words before sticking them in a search field in the database perhaps using Lucene stemming code, then try to use the Natural Language Search found in MySql 4.1.1. I am confident the MySql product can't keep up with Lucene yet, but at least they hvae improved it some. Not even sure if my hosting company will upgrade to 4.1.1 though. Still looking for a lot of solutions to make Lucene sit in synch more nicely with MySql as the main database...aka an easy to use way of handling - Original Message - From: Chris Lamprecht [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 1:38 am Subject: Re: Index Locking Issues Resolved...I hope MySQL does offer a basic fulltext search (with MyISAM tables), but it doesn't really approach the functionality of Lucene, such as pluggable tokenizers, stemming, etc. I think MS SQL server has fulltext search as well, but I have no idea if it's any good. See http://www.google.com/search?hl=enlr=safe=offc2coff=1q=mysql+fulltext I have not seen clear yet because it is all new. I wish a database Text field could have this sort of mechanism built into it. MySql does not do this (what I am using), but I am going to check into other databases now. OJB will work with most all of them so that would help if there is a database type of solution that will allow that sleep at night thing to happen!!! --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Considering intermediary solution before Lucene question
Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
This is so cool Otis. I was just to write this off of something in the FAQ, but this is better then what I was doing. This rocks!!! Thank you. JohnE P.S.: I am assuming you use org.apache.lucene.analysis.Token? There are three Token's under Lucene. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 7:17 pm Subject: Re: Considering intermediary solution before Lucene question Yes, you can use just the Analysis part. For instance, I use this for http://www.simpy.com and I believe we also have this in the Lucene bookas part of the source code package: /** * Gets Tokens extracted from the given text, using the specified Analyzer. * * @param analyzer the codeAnalyzer/code to use * @param text the text to analyze * @param field the field to pass to the Analyzer for tokenization * @return an array of codeToken/codes * @exception IOException if an error occurs */ public static Token[] getTokens(Analyzer analyzer, String text, String field) throws IOException { TokenStream stream = analyzer.tokenStream(field, new StringReader(text)); ArrayList tokenList = new ArrayList(); while (true) { Token token = stream.next(); if (token == null) break; tokenList.add(token); } return (Token[]) tokenList.toArray(new Token[0]); } Otis --- [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
I thank you both. I have it already partly implemented here. It seems easy. At least this should carry through my product until I can really get to use Lucene. I am not sure how far I can take MySql with stemmed, indexed key words, but should give me maybe 6 monthes at least of something useful as opposed to impossible searching. I need time and this might just be the trick. Always I fight for simplicity, but it is hard when you have 2 databases that have to keep in synch. If accuracy is important (people paying money) then handling all of the edge cases (such as the question that was just asked about if the machine goes down) are so important. I understand this is beyond the scope of Lucene. Thank you for the help. This really is an interesting project. JohnE - Original Message - From: Chris Lamprecht [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 7:08 pm Subject: Re: Considering intermediary solution before Lucene question John, It actually should be pretty easy to use just the parts of Lucene you want (the analyzers, etc) without using the rest. See the example of the PorterStemmer from this article: http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2 You could feed a Reader to the tokenStream() method of PorterStemAnalyzer, and get back a TokenStream, from which you pull the tokens using the next() method. On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking (incremental indexing)
I am interested in pursuing experienced peoples' understanding as I have half the queue approach developed already. I am not following why you don't like the queue approach Sergiu. From what I gathered from this board, if you do lots of updates, the opening of the WriterIndex is very intensive and should be used in a batch orientation rather then on a one-at-a-time incremental approach. In some cases on this board they talk about it being so overwhelming that people are putting forced delays so the Java engine can catch up. Using a queueing approach, you may get a hit every 30 seconds or minute or...whatever you choose as your timeframe, but it should be enough of a delay to allow the java engine to not be overwhelmed. I would like this not to happen with Lucene and would like to be able to update every time an update occurs, but this does not seem the right approach right now. As I said before, this seems like a wish item for Lucene. I don't really know if the wish is feasible. So far the biggest problem I was facing with this approach, however, was having feedback from the archiving process to the main database that the archiving change actually has happened and correctly even if the server goes down. JohnE Personally I don't like the Queue aproach... because I already implemented multithreading in out application to improve its performance. In our application indexing is not a high priority, but it's happening quite often. Search is a priority. Lucene allows to have more searches at on time. When you have a big index and a many users then ... the Queue aproach can slow down your application to much. I think it will be a bottleneck. I know that the lock problem is annoying, but I also think that the right way is to identify the source of locking. Our application is a webbased application based on turbine, and when we want to restart tomcat, we just kill the process (otherwise we need to restart 2 times because of some log4j initialization problem), so ... the index is locked after the tomcat restart. In my case it makes sense to check if index is locked one time at startup. I'm also logging all errors that I get in the systems, this is helping me to find their sourcce easier. All the best, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Locking Issues Resolved...I hope
Very cool Luke. I am not quite there yet. I am half way through implementing the queue approach, but I have hit walls that are making me sit back and figure out my strategy. I have a struts/tomcat/ojb/mysql project that can potentially have a million records and growing over time and updates will occur perhaps 100,000/day. This is not today, but what I am building for. My concerns not just Lucene itself, but its surrounding effects as follows. I am finding out that edge case scenerios are making things difficult due to having two databases instead of one. - How to know the index on this huge database is always in synch. - What happens if the server crashes or is brought down. solution might be db last modified date - Backups of the database and the index handled in an efficient, safe manner on a live system. - How to reindex while the system is in place solution might be doing new index to a different location as a seperate tool. - How to handle the fact that the IndexWriter is not very good in incremental data cases in a high volume update/query system. soluction might be to query for records from the database that have changed every 45 seconds or so and applying the changes. - How the IndexWriter solution above might cause bad lag on queries frequently. no solution - how to get Tomcat to start up a thread to run this updater at startup and not have a problem with memory management. - How to make this all work in my startup business to allow me to feel I can sleep at night. In general, things just got much more complicated then I was hoping for though I don't know how I can do without using Lucene or something like Lucene. This has been done so many times before that I would have suspected it would be easy, but I have not seen clear yet because it is all new. I wish a database Text field could have this sort of mechanism built into it. MySql does not do this (what I am using), but I am going to check into other databases now. OJB will work with most all of them so that would help if there is a database type of solution that will allow that sleep at night thing to happen!!! If you have input to these things, I had found some answers in the mailing list, but not really a concept of how to manage the whole thing. Is there an incremental big open source project out there that uses Lucene and a database? I don't think so. If you have any code or ideas I would appreciate both!!! Also having a FAQ that handles lots of these common problems, though a bit off topic they are, might really help people choose to use Lucene. Thanks, JohnE - Original Message - From: Luke Shannon [EMAIL PROTECTED] Date: Tuesday, November 16, 2004 10:51 pm Subject: Index Locking Issues Resolved...I hope Hello; I think I have solved my locking issues. I just made it through the set of test cases that previously resulted in Index Locking Errors. I just removed the method from my code that checks for a Index lock and forcefully removes it after 1 minute. Hopefully they never need to be put back in. Here is what I changed: I moved all my Indexer logic into a class called Index.java that implementedRunnable. Index's start() called a method named go() which was static and synchronized. go() kicks off all the logic to update the index (the reader, writer and other members involved with incremental updates also static). I put logging in place that logs when a thread has executed the method and what the thread's name is. Every time a client class changes the content it can create a thread reference and pass it the runnable Index. The convention I have requestedfor naming the thread is a toString() of the current date. Then they start the thread. How it worked: A few users just tested the system, half added documents to the system while another half deleted documents at the same time. No locking issues were seen and the index was current with the changes made a short time after the last operation (in my previous code this test resulted in a issue with indexlocking). I was able to go through the log file and find the start of the synchronizedgo() method and the successful completion of the indexing operations for every request made. The only performance issue I noticed was if someone added a very large PDF it took a while before the thread handling the request could finish. If this is the first operation of many it means the operations following this large file take that much longer. Luckily for me search results don't need to be instant. Things are looking much better. For now... Thanks to all that helped me up till now. Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 4:01 PM Subject: Re: _4c.fnm missing 'Concurrent' and 'updates' in the same
Re: Lucene : avoiding locking
I am new to Lucene, but have a large project in production on the web using other apache software including Tomcat, Struts, OJB, and others. The database I need to support will hopefully grow to millions of records. Right now it only has thousands but it is growing. These documents get updated by users regularly, but not frequently. When you have 100k users though, infrequently means you still have to deal with lock types of issues. When they update their record, their search criteria will have to be updated and they will expect to see results somewhat immediately. In moving from exact matching which is very poor for searches to Lucene, this locking is the only thing that has me nervous. I would really like a well thought out scheme for incremental changes as I won't generally need batch unless I have to delete/recreate the database for some reason. Thinking about most online forums, I think incremental is the way they would like to be able to go for searching. I have lots to learn about this project, but I really like what I see besides that locking issue. If I get into this more and understand details maybe I will have something to offer later. Lots to learn first though. Thank you for your hard work, JohnE I am curious, though, how many people on this list are using Lucene in the incremental update case. Most examples I've seen all assume batch indexing. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking (incremental indexing)
It really seems like I am not the only person having this issue. So far I am seeing 2 solutions and honestly I don't love either totally. I am thinking that without changes to Lucene itself, the best general way to implement this might be to have a queue of changes and have Lucene work off this queue in a single thread using a time-settable batch method. This is similar to what you are using below, but I don't like that you forcibly unlock Lucene if it shows itself locked. Using the Queue approach, only that one thread could be accessing Lucene for writes/deletes anyway so there should be no unknown locking. I can imagine this being a very good addition to Lucene - creating a high level interface to Lucene that manages incremental updates in such a manner. If anybody has such a general piece of code, please post it!!! I would use it tonight rather then create my own. I am not sure if there is anything that can be done to Lucene itself to help with this need people seem to be having. I realize the likely reasons why Lucene might need to only have one Index writer and the additional load that might be caused by locking off pieces of the database rather then the whole database. I think I need to look in the developer archives. JohnE - Original Message - From: Luke Shannon [EMAIL PROTECTED] Date: Monday, November 15, 2004 5:14 pm Subject: Re: Lucene : avoiding locking (incremental indexing) Hi Luke; I have a similar system (except people don't need to see results immediatly). The approach I took is a little different. I made my Indexer a thread with the indexing operations occuring the in run method. When the IndexWriter is to be created or the IndexReader needs to execute a delete I called the following method: private void manageIndexLock() { try { //check if the index is locked and deal with it if it is if (index.exists() IndexReader.isLocked(indexFileLocation)) { System.out.println(INDEXING INFO: There is more than one process trying to write to the index folder. Will wait for index to become available.);//perform this loop until the lock if released or 3 mins // has expired int indexChecks = 0; while (IndexReader.isLocked(indexFileLocation) indexChecks 6) { //increment the number of times we check the index // files indexChecks++; try { //sleep for 30 seconds Thread.sleep(3L); } catch (InterruptedException e2) { System.out.println(INDEX ERROR: There was a problem waiting for the lock to release. + e2.getMessage()); } }//closes the while loop for checking on the index // directory //if we are still locked we need to do something about it if (IndexReader.isLocked(indexFileLocation)) { System.out.println(INDEXING INFO: Index Locked After 3 minute of waiting. Forcefully releasing lock.); IndexReader.unlock(FSDirectory.getDirectory(index, false)); System.out.println(INDEXING INFO: Index lock released); }//close the if that actually releases the lock }//close the if ensure the file exists }//closes the try for all the above operations catch (IOException e1) { System.out.println(INDEX ERROR: There was a problem waiting for the lock to release. + e1.getMessage()); } }//close the manageIndexLock method Do you think this is a bad approach? Luke - Original Message - From: Luke Francl [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, November 15, 2004 5:01 PM Subject: Re: Lucene : avoiding locking (incremental indexing) This is how I implemented incremental indexing. If anyone sees anything wrong, please let me know. Our motivation is similar to John Eichel's. We have a digital asset management system and when users update, delete or create a new asset, they need to see their results immediately. The most important thing to know about incremental indexing that multiple threads cannot share the same IndexWriter, and only one IndexWriter can be open on an index at a time. Therefore, what I did was control access to the IndexWriter through a singleton wrapper class that synchronizes access to the IndexWriter and IndexReader (for deletes). After finishing writing to the index, you must close the IndexWriter to flush the changes to the index. If you do this you will be fine. However, opening and closing the index takes time so we had to look for some ways to speed up the indexing. The most obvious thing is that you should do as much work as possible outside of the synchronized block. For example, in my application, the creation of Lucene Document objects is not synchronized. Only the part of the code that is between your IndexWriter.open() and IndexWriter.close() needs to be synchronized. The other easy thing I did to improve performance was batch changes in a transaction together