unexpected results from query

2003-11-25 Thread marc
Hi,

assume a field has the following text

"Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) "

the following searches all return this document

AMP
&
&

can someone explain this to me..i figured that only the first query would be successful

Thanks,
Marc


Re: Document Clustering

2003-11-11 Thread marc
Thanks everyone for the responses and links to resources..

I was basically thinking of using lucene to generate document vectors, and
writing my custom similarity algorithms for measuring distance.

I could then run this data through k-means or SOM algorithms for calculating
clusters

Does this sound like i'm on the right track...i'm still just in the
*thinking* stage.

Marc


- Original Message - 
From: "Alex Aw Seat Kiong" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 5:47 PM
Subject: Re: Document Clustering


> Hi!
>
> I'm also interest it. Kindly CC to me the lastest progress of your
> clustering project.
>
> Regards,
> AlexAw
>
>
> - Original Message - 
> From: "Eric Jain" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, November 11, 2003 10:07 PM
> Subject: Re: Document Clustering
>
>
> > > I'm working on it. Classification and Clustering as well.
> >
> > Very interesting... if you get something working, please don't forget to
> > notify this list :-)
> >
> > --
> > Eric Jain
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document Clustering

2003-11-11 Thread marc
Hi,

does anyone have any sample code/documentation available for doing document based 
clustering using lucene?

Thanks,
Marc



parallizing index building

2003-06-26 Thread Marc Dumontier
Hi,

I'm indexing 500 XML files each ~150Mb on an 8 CPU machine.

I'm wondering what the best strategy for making maximum use of resources is. I have 
the tweaked the single process indexer to index 5000 records (not files) in memory 
before writing out to disk.

Should i create an IndexThread and share the IndexWriter object across 5 threads..then 
monitor when one ends to start another, etc. Or should i create difference indexes 
then to a series of merges.

any help would be appreciated,

thanks,
Marc Dumontier
Bioinformatics Application Developer
Blueprint Initiative
Mount Sinai Hospital
Toronto
http://www.bind.ca


Release schedule?

2003-03-04 Thread Marc Worrell
Hi,

We are incorporating Lucene in a CMS.  It does some quite fancy matching and searching 
of documents and uses Lucene as one of its components.  We would like to influence the 
scoring of search terms for some fields.  This is possible with the new Similarity 
class that is implemented after release 1.2.

Is there a next release scheduled?  And so, when (approximately will that be)?

We would like to run the cms with tested code and not with the code from the lucene 
cvs...

Greetings, 

Marc Worrell


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



significant performance issues

2003-01-07 Thread Marc Dumontier
Hi all,

I just started trying to use Lucene to index approximately 13,000 XML 
documents representing biological data..each document is approximately 
20-30KB.

I modified some code from cocoon components to use SAX to parse my 
documents and create Lucene Documents. This process is very quick.

The following code is where i started off to write the index to disk.

writer = new IndexWriter(fsd, analyzer, true);

Iterator myit = docList.iterator();
   while(myit.hasNext()) {
   writer.addDocument((Document)myit.next());
   System.out.println(++counter);
}
writer.close();

This is taking much more time than expected. I'm using the 
StandardAnalyzer, and my XML data is about 20-30Kb per file. The 
indexing is taking approximately 2-3 seconds per document and as the 
index grows it gets significantly slower. I'm running this on a 2.4GHz 
linux machine with 1GB ram.

I tried a few different stragegies, but i end up with too many files 
open exceptions.

I don't think it should progressively slow down in proportion to the 
size of the index. Is this assumption wrong?

Am i doing something wrong? is there a way to utilize the memory more 
and the filesystem less and just dump the index periodically?

any help would be appreciated..thanks

Marc Dumontier
Intermediate Developer
Blueprint Initiative
Mount Sinai Hospital
http://www.bind.ca



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>



Re: Help for german queries

2002-10-02 Thread Marc Guillemot

Great, your stemmer does the job I expected for Umlaut. Thanks.

Has someone an idea for composed words ("betreuung" is not found in a doc
containing "Kundenbetreuung")?

Marc.



- Original Message -
From: "Clemens Marschner" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, October 02, 2002 2:36 PM
Subject: Re: Help for german queries


> Hm sorry I don't have the time right now, but I think it took me 10
minutes
> to discover the location where I had to do the changes.
> I thought ä=ae would already be included.
> I included my GermanStemmer version in this post. Sorry I can't do
> CVSing/diff'ing at the moment.
> The stemmer does ä->a and ae->a and doesn't distinguish between uppercase
> and lowercase. I'm not a linguist, so I can't say if it does overstemming.
I
> commented out the expression below
>
>  // "t" occurs only as suffix of verbs.
>  else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&&
> !uppercase*/ ) {
>   buffer.deleteCharAt( buffer.length() - 1 );
>  }
>  else {
>   doMore = false;
>  }
>
> Hope that helps
>
> Clemens
>
> - Original Message -
> From: "Marc Guillemot" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, October 02, 2002 12:47 PM
> Subject: Re: Help for german queries
>
>
> > The problem/question is not on the first letter case because but only on
> the
> > equivalence between "ä" and "ae" for instance.
> >
> > in my tests, searching for:
> > - Geschäft -> 13 results
> > - geschäft -> 0 result
> > - Geschaeft -> 0 result
> > - geschaeft -> 0 result
> >
> > Marc.
> >
> >
> > - Original Message -
> > From: "Clemens Marschner" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, October 01, 2002 1:16 PM
> > Subject: Re: Help for german queries
> >
> >
> > > there's a "feature" in the German stemmer (I would call it a bug) that
> > > treats words ending with "t" differently if they start with a capital
or
> > > non-capital letter. Are you sure you didn't type "geschäft" and
> > "Geschaeft"?
> > > Cause that's supposedly stemmed differently.
> > >
> > > --Clemens
> > >
> > > - Original Message -
> > > From: "Marc Guillemot" <[EMAIL PROTECTED]>
> > > To: <[EMAIL PROTECTED]>
> > > Sent: Tuesday, October 01, 2002 9:40 AM
> > > Subject: Help for german queries
> > >
> > >
> > > > Hi,
> > > >
> > > > I've performed some tests with Lucene for german indexation/search
but
> I
> > > > don't get the results I expected:
> > > >
> > > > - Umlaut:
> > > > search for:
> > > > - "Geschäft" -> x results
> > > > - "Geschaeft" -> no result
> > > > Is there an option in the standard german classes to make the 2
> searches
> > > > above equivalent?
> > > >
> > > > - Composed words:
> > > > "betreuung" is not found in a doc containing "Kundenbetreuung"
> > > >
> > > > Any suggestions?
> > > >
> > > > Marc.


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Help for german queries

2002-10-02 Thread Marc Guillemot

The problem/question is not on the first letter case because but only on the
equivalence between "ä" and "ae" for instance.

in my tests, searching for:
- Geschäft -> 13 results
- geschäft -> 0 result
- Geschaeft -> 0 result
- geschaeft -> 0 result

Marc.


- Original Message -
From: "Clemens Marschner" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, October 01, 2002 1:16 PM
Subject: Re: Help for german queries


> there's a "feature" in the German stemmer (I would call it a bug) that
> treats words ending with "t" differently if they start with a capital or
> non-capital letter. Are you sure you didn't type "geschäft" and
"Geschaeft"?
> Cause that's supposedly stemmed differently.
>
> --Clemens
>
> - Original Message -
> From: "Marc Guillemot" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, October 01, 2002 9:40 AM
> Subject: Help for german queries
>
>
> > Hi,
> >
> > I've performed some tests with Lucene for german indexation/search but I
> > don't get the results I expected:
> >
> > - Umlaut:
> > search for:
> > - "Geschäft" -> x results
> > - "Geschaeft" -> no result
> > Is there an option in the standard german classes to make the 2 searches
> > above equivalent?
> >
> > - Composed words:
> > "betreuung" is not found in a doc containing "Kundenbetreuung"
> >
> > Any suggestions?
> >
> > Marc.
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Help for german queries

2002-10-01 Thread Marc Guillemot

Hi,

I've performed some tests with Lucene for german indexation/search but I
don't get the results I expected:

- Umlaut:
search for:
- "Geschäft" -> x results
- "Geschaeft" -> no result
Is there an option in the standard german classes to make the 2 searches
above equivalent?

- Composed words:
"betreuung" is not found in a doc containing "Kundenbetreuung"

Any suggestions?

Marc.


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: SqlDirectory

2001-12-02 Thread Marc Kramis

here you find the current code:

java:
SQLDirectory.java. (should work with all sql databases via JDBC)

sql:
Oracle: see appended scripts
SQL Server: change varchar2 to varchar, integer to bigint and raw to binary

it seems to be quite stable by now.
the input and outputstream methods should be reviewed.

indexing seems to be a little faster than FSDirectory (both db and files on
remote server)
querying is slower, but included cache increases performance over time or
repeated queries (especially paging)

enjoy ;)
marc



- Original Message -
From: "Marc Kramis" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, November 26, 2001 9:38 PM
Subject: SqlDirectory


> hi all
>
> some time ago, there was a short discussion about a database store. I also
> needed some persistence layer that was accessible via JDBC. It turned out,
> that a BLOB implementation is strongly dependent on the RDBMS used and
also
> poorly performing.
>
> I implemented a SqlDirectory, based on the idea of RAMDirectory and its
> buffers as basic element.
> goals:
> 1. should work with all JDBC compliant RDBMS (no adaption required, no
> blobs!).
> 2. performance should be acceptable.
> 3. simple db schema.
>
> status:
> 1. tested on Oracle 8i (free oracle JDBC driver type 4) and SQL Server
2000
> (free microsoft JDBC beta driver type 4). works perfectly.
> 2. consists of 2 tables and 1 index. (one tablespace can have several
> indexes of course)
> 3. promising performance.
>
> todo:
> 1. test reliability, performance, concurrency (multiple reader/writer),
test
> with mySQL
> 2. code review
> 3. introduce caching (maybe CacheDirectory)
>
> if someone has experience or just likes to test it, mail me. Anyway, could
I
> simply attach the SqlDirectory.java file to my mails?
>
> marc
>



SqlDirectory.java
Description: JavaScript source


create_lucene.sql
Description: Binary data


drop_lucene.sql
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


synchronization problem / bug?

2001-11-27 Thread Marc Kramis

hi

while testing the SqlDirectory, I found some really strange thing: scenario
is concurrent writer and searcher:
1. a IndexWriter is started and creates a write.lock until the close method
is called. this cleanly prevents other writers to access the index at the
same time and is ok.
2. go on indexing ...

but now, concurrently, the following process goes on:
1. a Searcher is created with searcher = new IndexSearcher().
2.  this process creates a commit.lock as expected and reads some files.
3. the commit.lock is released. (immediately)
4. now, the querying is done and the hits.doc(i) is read. during this, no
commit.lock is set, but again, some files are accessed (the
InputStream.readInternal method is called).
5. the searcher.close() method is called which closes all open InputStreams.
(no commit.lock released or created)

like that, from time to time, a exception occures because the file has been
changed by the IndexWriter process running the same time.

Any ideas about this? this should also occur with FSDirectory or
RAMDirectory, but more rarely, because these are faster in reading
results...

cheers
marc



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




SqlDirectory

2001-11-26 Thread Marc Kramis

hi all

some time ago, there was a short discussion about a database store. I also
needed some persistence layer that was accessible via JDBC. It turned out,
that a BLOB implementation is strongly dependent on the RDBMS used and also
poorly performing.

I implemented a SqlDirectory, based on the idea of RAMDirectory and its
buffers as basic element.
goals:
1. should work with all JDBC compliant RDBMS (no adaption required, no
blobs!).
2. performance should be acceptable.
3. simple db schema.

status:
1. tested on Oracle 8i (free oracle JDBC driver type 4) and SQL Server 2000
(free microsoft JDBC beta driver type 4). works perfectly.
2. consists of 2 tables and 1 index. (one tablespace can have several
indexes of course)
3. promising performance.

todo:
1. test reliability, performance, concurrency (multiple reader/writer), test
with mySQL
2. code review
3. introduce caching (maybe CacheDirectory)

if someone has experience or just likes to test it, mail me. Anyway, could I
simply attach the SqlDirectory.java file to my mails?

marc



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>