unexpected results from query
Hi, assume a field has the following text "Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) " the following searches all return this document AMP & & can someone explain this to me..i figured that only the first query would be successful Thanks, Marc
Re: Document Clustering
Thanks everyone for the responses and links to resources.. I was basically thinking of using lucene to generate document vectors, and writing my custom similarity algorithms for measuring distance. I could then run this data through k-means or SOM algorithms for calculating clusters Does this sound like i'm on the right track...i'm still just in the *thinking* stage. Marc - Original Message - From: "Alex Aw Seat Kiong" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 11, 2003 5:47 PM Subject: Re: Document Clustering > Hi! > > I'm also interest it. Kindly CC to me the lastest progress of your > clustering project. > > Regards, > AlexAw > > > - Original Message - > From: "Eric Jain" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Tuesday, November 11, 2003 10:07 PM > Subject: Re: Document Clustering > > > > > I'm working on it. Classification and Clustering as well. > > > > Very interesting... if you get something working, please don't forget to > > notify this list :-) > > > > -- > > Eric Jain > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document Clustering
Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc
parallizing index building
Hi, I'm indexing 500 XML files each ~150Mb on an 8 CPU machine. I'm wondering what the best strategy for making maximum use of resources is. I have the tweaked the single process indexer to index 5000 records (not files) in memory before writing out to disk. Should i create an IndexThread and share the IndexWriter object across 5 threads..then monitor when one ends to start another, etc. Or should i create difference indexes then to a series of merges. any help would be appreciated, thanks, Marc Dumontier Bioinformatics Application Developer Blueprint Initiative Mount Sinai Hospital Toronto http://www.bind.ca
Release schedule?
Hi, We are incorporating Lucene in a CMS. It does some quite fancy matching and searching of documents and uses Lucene as one of its components. We would like to influence the scoring of search terms for some fields. This is possible with the new Similarity class that is implemented after release 1.2. Is there a next release scheduled? And so, when (approximately will that be)? We would like to run the cms with tested code and not with the code from the lucene cvs... Greetings, Marc Worrell - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
significant performance issues
Hi all, I just started trying to use Lucene to index approximately 13,000 XML documents representing biological data..each document is approximately 20-30KB. I modified some code from cocoon components to use SAX to parse my documents and create Lucene Documents. This process is very quick. The following code is where i started off to write the index to disk. writer = new IndexWriter(fsd, analyzer, true); Iterator myit = docList.iterator(); while(myit.hasNext()) { writer.addDocument((Document)myit.next()); System.out.println(++counter); } writer.close(); This is taking much more time than expected. I'm using the StandardAnalyzer, and my XML data is about 20-30Kb per file. The indexing is taking approximately 2-3 seconds per document and as the index grows it gets significantly slower. I'm running this on a 2.4GHz linux machine with 1GB ram. I tried a few different stragegies, but i end up with too many files open exceptions. I don't think it should progressively slow down in proportion to the size of the index. Is this assumption wrong? Am i doing something wrong? is there a way to utilize the memory more and the filesystem less and just dump the index periodically? any help would be appreciated..thanks Marc Dumontier Intermediate Developer Blueprint Initiative Mount Sinai Hospital http://www.bind.ca -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: Help for german queries
Great, your stemmer does the job I expected for Umlaut. Thanks. Has someone an idea for composed words ("betreuung" is not found in a doc containing "Kundenbetreuung")? Marc. - Original Message - From: "Clemens Marschner" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, October 02, 2002 2:36 PM Subject: Re: Help for german queries > Hm sorry I don't have the time right now, but I think it took me 10 minutes > to discover the location where I had to do the changes. > I thought ä=ae would already be included. > I included my GermanStemmer version in this post. Sorry I can't do > CVSing/diff'ing at the moment. > The stemmer does ä->a and ae->a and doesn't distinguish between uppercase > and lowercase. I'm not a linguist, so I can't say if it does overstemming. I > commented out the expression below > > // "t" occurs only as suffix of verbs. > else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&& > !uppercase*/ ) { > buffer.deleteCharAt( buffer.length() - 1 ); > } > else { > doMore = false; > } > > Hope that helps > > Clemens > > - Original Message - > From: "Marc Guillemot" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, October 02, 2002 12:47 PM > Subject: Re: Help for german queries > > > > The problem/question is not on the first letter case because but only on > the > > equivalence between "ä" and "ae" for instance. > > > > in my tests, searching for: > > - Geschäft -> 13 results > > - geschäft -> 0 result > > - Geschaeft -> 0 result > > - geschaeft -> 0 result > > > > Marc. > > > > > > - Original Message - > > From: "Clemens Marschner" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Tuesday, October 01, 2002 1:16 PM > > Subject: Re: Help for german queries > > > > > > > there's a "feature" in the German stemmer (I would call it a bug) that > > > treats words ending with "t" differently if they start with a capital or > > > non-capital letter. Are you sure you didn't type "geschäft" and > > "Geschaeft"? > > > Cause that's supposedly stemmed differently. > > > > > > --Clemens > > > > > > - Original Message - > > > From: "Marc Guillemot" <[EMAIL PROTECTED]> > > > To: <[EMAIL PROTECTED]> > > > Sent: Tuesday, October 01, 2002 9:40 AM > > > Subject: Help for german queries > > > > > > > > > > Hi, > > > > > > > > I've performed some tests with Lucene for german indexation/search but > I > > > > don't get the results I expected: > > > > > > > > - Umlaut: > > > > search for: > > > > - "Geschäft" -> x results > > > > - "Geschaeft" -> no result > > > > Is there an option in the standard german classes to make the 2 > searches > > > > above equivalent? > > > > > > > > - Composed words: > > > > "betreuung" is not found in a doc containing "Kundenbetreuung" > > > > > > > > Any suggestions? > > > > > > > > Marc. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: Help for german queries
The problem/question is not on the first letter case because but only on the equivalence between "ä" and "ae" for instance. in my tests, searching for: - Geschäft -> 13 results - geschäft -> 0 result - Geschaeft -> 0 result - geschaeft -> 0 result Marc. - Original Message - From: "Clemens Marschner" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, October 01, 2002 1:16 PM Subject: Re: Help for german queries > there's a "feature" in the German stemmer (I would call it a bug) that > treats words ending with "t" differently if they start with a capital or > non-capital letter. Are you sure you didn't type "geschäft" and "Geschaeft"? > Cause that's supposedly stemmed differently. > > --Clemens > > - Original Message - > From: "Marc Guillemot" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, October 01, 2002 9:40 AM > Subject: Help for german queries > > > > Hi, > > > > I've performed some tests with Lucene for german indexation/search but I > > don't get the results I expected: > > > > - Umlaut: > > search for: > > - "Geschäft" -> x results > > - "Geschaeft" -> no result > > Is there an option in the standard german classes to make the 2 searches > > above equivalent? > > > > - Composed words: > > "betreuung" is not found in a doc containing "Kundenbetreuung" > > > > Any suggestions? > > > > Marc. > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Help for german queries
Hi, I've performed some tests with Lucene for german indexation/search but I don't get the results I expected: - Umlaut: search for: - "Geschäft" -> x results - "Geschaeft" -> no result Is there an option in the standard german classes to make the 2 searches above equivalent? - Composed words: "betreuung" is not found in a doc containing "Kundenbetreuung" Any suggestions? Marc. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: SqlDirectory
here you find the current code: java: SQLDirectory.java. (should work with all sql databases via JDBC) sql: Oracle: see appended scripts SQL Server: change varchar2 to varchar, integer to bigint and raw to binary it seems to be quite stable by now. the input and outputstream methods should be reviewed. indexing seems to be a little faster than FSDirectory (both db and files on remote server) querying is slower, but included cache increases performance over time or repeated queries (especially paging) enjoy ;) marc - Original Message - From: "Marc Kramis" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, November 26, 2001 9:38 PM Subject: SqlDirectory > hi all > > some time ago, there was a short discussion about a database store. I also > needed some persistence layer that was accessible via JDBC. It turned out, > that a BLOB implementation is strongly dependent on the RDBMS used and also > poorly performing. > > I implemented a SqlDirectory, based on the idea of RAMDirectory and its > buffers as basic element. > goals: > 1. should work with all JDBC compliant RDBMS (no adaption required, no > blobs!). > 2. performance should be acceptable. > 3. simple db schema. > > status: > 1. tested on Oracle 8i (free oracle JDBC driver type 4) and SQL Server 2000 > (free microsoft JDBC beta driver type 4). works perfectly. > 2. consists of 2 tables and 1 index. (one tablespace can have several > indexes of course) > 3. promising performance. > > todo: > 1. test reliability, performance, concurrency (multiple reader/writer), test > with mySQL > 2. code review > 3. introduce caching (maybe CacheDirectory) > > if someone has experience or just likes to test it, mail me. Anyway, could I > simply attach the SqlDirectory.java file to my mails? > > marc > SqlDirectory.java Description: JavaScript source create_lucene.sql Description: Binary data drop_lucene.sql Description: Binary data -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
synchronization problem / bug?
hi while testing the SqlDirectory, I found some really strange thing: scenario is concurrent writer and searcher: 1. a IndexWriter is started and creates a write.lock until the close method is called. this cleanly prevents other writers to access the index at the same time and is ok. 2. go on indexing ... but now, concurrently, the following process goes on: 1. a Searcher is created with searcher = new IndexSearcher(). 2. this process creates a commit.lock as expected and reads some files. 3. the commit.lock is released. (immediately) 4. now, the querying is done and the hits.doc(i) is read. during this, no commit.lock is set, but again, some files are accessed (the InputStream.readInternal method is called). 5. the searcher.close() method is called which closes all open InputStreams. (no commit.lock released or created) like that, from time to time, a exception occures because the file has been changed by the IndexWriter process running the same time. Any ideas about this? this should also occur with FSDirectory or RAMDirectory, but more rarely, because these are faster in reading results... cheers marc -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
SqlDirectory
hi all some time ago, there was a short discussion about a database store. I also needed some persistence layer that was accessible via JDBC. It turned out, that a BLOB implementation is strongly dependent on the RDBMS used and also poorly performing. I implemented a SqlDirectory, based on the idea of RAMDirectory and its buffers as basic element. goals: 1. should work with all JDBC compliant RDBMS (no adaption required, no blobs!). 2. performance should be acceptable. 3. simple db schema. status: 1. tested on Oracle 8i (free oracle JDBC driver type 4) and SQL Server 2000 (free microsoft JDBC beta driver type 4). works perfectly. 2. consists of 2 tables and 1 index. (one tablespace can have several indexes of course) 3. promising performance. todo: 1. test reliability, performance, concurrency (multiple reader/writer), test with mySQL 2. code review 3. introduce caching (maybe CacheDirectory) if someone has experience or just likes to test it, mail me. Anyway, could I simply attach the SqlDirectory.java file to my mails? marc -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>