Re: Search Performance

2005-02-18 Thread Stefan Groschupf
Try a singleton pattern or an static field. Stefan Michael Celona wrote: I am creating new IndexSearchers... how do I cache my IndexSearcher... Michael -Original Message- From: David Townsend [mailto:[EMAIL PROTECTED] Sent: Friday, February 18, 2005 11:00 AM To: Lucene Users List Subject:

Re: Search Engine review article/book

2005-01-26 Thread Stefan Groschupf
+ the lucene in action book. :-) + scholar.google.com + acm.org ir group + ieee.org has ir group as well may you will find http://searchenginewatch.com/ useful as well. HTH Stefan Am 26.01.2005 um 23:18 schrieb Xiaohong Yang ((Sharon)): Hi all, I am looking for good review articles or books regar

Re: Sort Performance Problems across large dataset

2005-01-24 Thread Stefan Groschupf
Hi, do you optimize the index? Do you tried to implement a own hit collector? Stefan Am 25.01.2005 um 01:01 schrieb Peter Hollas: I am working on a public accessible Struts based species database project where the number of species names is currently at 2.3 million, and in the near future will be

Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread Stefan Groschupf
Do you know: http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ? Interesting - is there any code avail to draw the maps? The algorithm is described here; http://www.cis.hut.fi/research/som-research/book/ A short summary and some sample code is available here: http://davis.wpi.edu/

Re: search multiple indexes

2004-07-01 Thread Stefan Groschupf
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ MultiSearcher.html 100% Right. I personal found code samples more interesting then just java doc. That why my hint, here the code snippet from nutch: /** Construct given a number of indexed segments. */ public IndexSearcher(Fil

Re: search multiple indexes

2004-07-01 Thread Stefan Groschupf
Possibly a silly question - but how would I go about searching multiple indexes using lucene? Do I need to basically repeat the code I use to search one index for each one, or is there a better way to do it? Take a look to the nutch.org sourcecode. It does what you are searching for. HTH Stefan -

Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread Stefan Groschupf
Dave, cool stuff, think aboout to contribute that to nutch.. ;-)! Do you know: http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ? Cheers, Stefan Am 01.07.2004 um 23:28 schrieb David Spencer: Inspired by these guys who put results from Google into a treemap... http://google.hivegro

[bug?] term frequency and empty content

2004-06-26 Thread Stefan Groschupf
Hi, I notice some thing strange: (1.4-rc4) Until I add a empty text to my index: where text is "" or null; IndexWriter indexWriter = getIndexWriter(); document.add(Field.Text(Corpus.TEXT, text, true)); indexWriter.addDocument(document); I see this in std.out: "No tvx file" Furthermore IndexReade

term frequence in hits

2004-06-26 Thread Stefan Groschupf
Hi, another question, but first many thanks for the last hint, the new term frequency functionality of lucene is just GREAT! ;) I have index a set of documents with different meta data, Language = DE or Language = EN. Now i wish to get Term frequencies for DE and EN. The easiest solution would b

term vector

2004-06-23 Thread Stefan Groschupf
Hi, sorry, a stupid question, Is there a best practice to get the term vector of an document? Is there any experience to do any kind of feature selection for dimension reducing like zipf laws or getting tf/idf of a term for the complete corpora. Thanks for any hints. Stefan -

Re: hit score in 1.3 vs 1.4

2004-06-11 Thread Stefan Groschupf
Hi Erik, in case we will meet one time and I sure since the world is small, I have to invite you to a beer! :-) Thanks your suggestion works, recreating the index solved the problem... Stefan Am 11.06.2004 um 12:12 schrieb Erik Hatcher: On Jun 11, 2004, at 5:51 AM, Stefan Groschupf wrote: Hi

hit score in 1.3 vs 1.4

2004-06-11 Thread Stefan Groschupf
Hi, I'm having a strange problem until upgrading lucene 1.3 to 1.4 rc4. I'm using a third party component that include the old lucene 1.3 but i need to run the new 1.4 rc 4 in the same vm. So i unpack the component jar, remove all lucene 1.3 classes and repack it again and just add the new lucene

Re: similarity of two texts

2004-05-31 Thread Stefan Groschupf
Lucene can't help you. Search for text classification or text clustering. Browse the tools section @ www.text-mining.org there you will found may be tools that can help you with this task. In general some key words for your further search: Feature extraction from text. Data mining algorithms for

Re: Paid support for Lucene - SUMMARY

2004-01-30 Thread Stefan Groschupf
Am 30.01.2004 um 22:11 schrieb Stefan Groschupf: JBoss Group http://jboss.org/ Does jboss really support maven? Sorry, doing 2 things at the same time is not good. Should be: "Does jboss really support lucene?" Stefan open technology: www.media-style.com o

Re: Paid support for Lucene - SUMMARY

2004-01-30 Thread Stefan Groschupf
JBoss Group http://jboss.org/ Does jboss really support maven? I think they are more focused on its j2ee server. open technology: www.media-style.com open source: www.weta-group.net open discussion:www.text-mining.org --

Re: Paid support for Lucene

2004-01-29 Thread Stefan Groschupf
I will not, but I would work to get a degree from mit.edu. B-) Just kidding, I wouldn't do that. http://www.ai.mit.edu/research/sponsors/sponsors.shtml Peace! Stefan I am willing as well. Scott On Jan 29, 2004, at 12:04 PM, Boris Goldowsky wrote: Strangely, the web site does not seem to list

Re: HTML tag filter...

2004-01-10 Thread Stefan Groschupf
If you browse the cvs of nutch.org you will found an implementation. HTH Stefan Am 10.01.2004 um 19:43 schrieb [EMAIL PROTECTED]: Hi group, would it be possible to implement a Analyser who filters HTML code out of a HTML page. As a result I would have only the text free of any tagging. Is is m

Re: boosting & StandardAnalyzer, stop words

2003-12-10 Thread Stefan Groschupf
Perhaps we'd better continue this on lucene-dev. Ok, i will subscribe this list and request again. Thanks! Stefan -- open technology: http://www.media-style.com open source: http://www.weta-group.net open discussion: http://www.text-mining.org ---

Re: boosting & StandardAnalyzer, stop words

2003-12-09 Thread Stefan Groschupf
Ype, It's a bug, and there is a fix for this in the latest CVS near the end of the QueryParser.jj file: // avoid boosting null queries, such as those caused by stop words if (q != null) { q.setBoost(f); } I had checked out the latest sources from public cvs. The posted cod

Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
Damian Gajda wrote: BTW. i may send You the partly working Lucene with Dmitrys code patched in. Yeah that would be very helpful. Thanks! -- open technology: http://www.media-style.com open source: http://www.weta-group.net open discussion: http://www.text-mining.org --

boosting & StandardAnalyzer

2003-12-08 Thread Stefan Groschupf
Hi, I notice something really strange. I just tried the "document to query" thing with term frequencies and term bosting based on the term frequence. The code itself take may be 3 minutes, but i spend around 2 hours to search a nullpointer exception i got in this line. query = QueryParser.p

Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
Damian Gajda wrote: Hello I already have some experience with Dmitry's implementation. Can you point me to Dmitry's code,so that i can take a look, i just had read about it Feel free to contact me. I will do! ;) Thanks! Stefan -- open technology: www.media-style.com open source: www.w

Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
A few people have asked, a few people have expressed interest. I have to do some work for nutch but since I need the feature vector stuff for an commercial project I will try to implement it. Someone wish to join me??? ;) Stefan -- open technology: www.media-style.com open source: w

Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
Just to be sure since there was a lot of dicussion in the lists. There is actually no solution available to get a term vector for a document or a TF/IDF feature vector for a document, isn't it? Some one had work on such things? Some wish to work on such things? Stefan -

Document Similarity

2003-12-08 Thread Stefan Groschupf
Hi Jing, do you work on the task of document similarity? I see nobody was answering your question. To create a query out of an document would be very easy, but would it provide well results? Document term vectors would provide more possibilities to use different data mining algorithms for cluste

Re: term vector (Damian patch)

2003-12-08 Thread Stefan Groschupf
Otis, based on this discussion: http://www.mail-archive.com/[EMAIL PROTECTED]/msg03350.html Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

term vector (Damian patch)

2003-12-08 Thread Stefan Groschupf
Hi there, is Damian patch in the cvs or latest lucene release. Allow this patch to recieve a term vector of a document? Thanks! Stefan -- open technology: www.media-style.com open source: www.weta-group.net open discussion: www.text-mining.org --

Re: SearchBlox J2EE Search Component Version 1.1 released

2003-12-02 Thread Stefan Groschupf
Tun Lin wrote: Anyone knows a search engine that supports xml formats? http://jakarta.apache.org/lucene/docs/lucene-sandbox/ see SAX/ DOM XML demo. -- open technology: www.media-style.com open source: www.weta-group.net open discussion: www.text-mining.org -

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
Herb, On Friday 14 November 2003 13:39, Chong, Herb wrote: you're describing ad-hoc solutions to a problem that have an effect, but not one that is easily predictable. one can concoct all sorts of combinations of the query operators that would have something of the effect that i am describing.

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
PA, But Lucene is an low level indexing library. I'm sure most people here will agree that lucene is much more than a _low level_ indexing library. May be it is just a library, but definitely the *highest level* search technology available in the web for free. You ride roughshod over the hard

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
really cool Stuff!!! maurits van wijland wrote: Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Classification: you have already categories and samples for it, that help you to match other

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi Marc, I'm working on it. Classification and Clustering as well. I was planing doing it for nutch.org, but actually some guys there breakup some important basic work I already had done, so may be i will not contribute it there. However it will be open source and I can notice you if something

Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf
Wouldn't mind joining in a joint approach, only problem is timing - it would probably be late December before we could start putting the hours in. We all do this just for fun, so no rush! However more people less work for everybody, faster results. We only need a generic API but i had done som

Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf
alternate solution for pdfs. I'd be interested in knowing whether anyone is working on a pure java solution that would give us a single method for handling ms office documments / pdfs / etc. Cheers Pete - Original Message - From: "Stefan Groschupf" <[EMAIL PROTECTED]> T

Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf
I had write to this list some days ago, to announce a possibility to parse 182 file formats. There was a tiny bug report some days ago, i hope i can fix it. Browse the archive to figure out more. Cheers Stefan Marcel Stor wrote: Hi all, I'm thinkin' about writing a search tool for my filesyste

zipf law?

2003-11-02 Thread Stefan Groschupf
Hi, sorry a very stupid question does lucene zipf laws until indexing? Thanks Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

182 file formats for lucene!!! was: Re: Exotic format indexing?

2003-10-30 Thread Stefan Groschupf
Hi there, just to let you know, i had implement for the nutch project a plugin that can parse 182 file formats including m$ office. I simply use open office and use the available java api. It is really straight forward to use. Found some info's and a link to the open source code here: http://so

Re: Best practice

2003-10-28 Thread Stefan Groschupf
William W wrote: Hi Folks, Is there any Lucene best practice ? www.nutch.org ;) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Jmeter

2003-10-15 Thread Stefan Groschupf
Elsa Hernandez wrote: Hi, I would like to know if someone has used Jmeter to prove/test the performance of your web applications, or if someone could suggest a tool/application that they have used. Thank you. http://eclipsecolorer.sourceforge.net/index_profiler.html Is the best i ever found.

pre analysing

2003-09-21 Thread Stefan Groschupf
Hi there, I wish to run an pre analyzer that help me to choice the right analyzer I wish to run on my stream. For instance i wish to analyze the language of my text and choice then an language dependent stop word remover. Since it is a token stream an my language detection need the whole text t