Re: Search Performance

2005-02-18 Thread Stefan Groschupf
Try a singleton pattern or an static field.
Stefan
Michael Celona wrote:
I am creating new IndexSearchers... how do I cache my IndexSearcher...
Michael
-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:00 AM
To: Lucene Users List
Subject: RE: Search Performance

Are you creating new IndexSearchers or IndexReaders on each search?  Caching
your IndexSearchers has a dramatic effect on speed.
David Townsend
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 15:55
To: Lucene Users List
Subject: Search Performance
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  


Michael
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Engine review article/book

2005-01-26 Thread Stefan Groschupf
+  the lucene in action book. :-)
+  scholar.google.com
+ acm.org ir group
+ ieee.org has ir group as well
may you will find http://searchenginewatch.com/ useful as well.
HTH
Stefan
Am 26.01.2005 um 23:18 schrieb Xiaohong Yang ((Sharon)):
Hi all,
I am looking for good review articles or books regarding latest search 
engine development trend and practices.  Any suggestions would be very 
helpful.  Any comments not covered by articles are also welcome.

Thanks a lot,
Sharon
---
company:http://www.media-style.com
forum:  http://www.text-mining.org
blog:   http://www.find23.net
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Stefan Groschupf
Hi,
do you optimize the index?
Do you tried to implement a own hit collector?
Stefan
Am 25.01.2005 um 01:01 schrieb Peter Hollas:
I am working on a public accessible Struts based species database 
project where the number of species names is currently at 2.3 million, 
and in the near future will be somewhere nearer 4 million (probably 
the largest there is). The species names are typically 1 to 7 words in 
length, and the broad requirement is to be able to do a fulltext 
search across them. It is also necessary to sort the results into 
alphabetical order by species name.

Currently we can issue a simple search query and expect a response 
back in about 0.2 seconds (~3,000 results) with the Lucene index that 
we have built. Lucene gives a much more predictable and faster average 
query time than using standard fulltext indexing with mySQL. This 
however returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.

Any ideas are appreciated!
Many thanks, Peter.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
company:http://www.media-style.com
forum:  http://www.text-mining.org
blog:   http://www.find23.net
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread Stefan Groschupf

Do you know:
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ?
Interesting - is there any code avail to draw the maps?
The algorithm is described here;
http://www.cis.hut.fi/research/som-research/book/
A short summary and some sample code is available here:
http://davis.wpi.edu/~matt/courses/soms/
Some more interesting papers about visualization is available at the 
text-mining.org community page.
http://www.text-mining.org/index.jsp?folderPK=793

Happy hacking! :-)
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: search multiple indexes

2004-07-01 Thread Stefan Groschupf
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
MultiSearcher.html

100% Right.
I personal found code samples more interesting then just java doc.
That why my hint, here the code snippet from nutch:
/** Construct given a number of indexed segments. */
  public IndexSearcher(File[] segmentDirs) throws IOException {
NutchSimilarity sim = new NutchSimilarity();
Searchable[] searchables = new Searchable[segmentDirs.length];
segmentNames = new String[segmentDirs.length];
for (int i = 0; i < segmentDirs.length; i++) {
  org.apache.lucene.search.Searcher searcher =
new org.apache.lucene.search.IndexSearcher
(new File(segmentDirs[i], "index").toString());
  searcher.setSimilarity(sim);
  searchables[i] = searcher;
  segmentNames[i] = segmentDirs[i].getName();
}
this.luceneSearcher = new MultiSearcher(searchables);
this.luceneSearcher.setSimilarity(sim);
  }
Kent Beck said: "Monkey see, Monkey do." ;-)
Cheers,
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: search multiple indexes

2004-07-01 Thread Stefan Groschupf
Possibly a silly question - but how would I go about searching multiple
indexes using lucene?  Do I need to basically repeat the code I use to
search one index for each one, or is there a better way to do it?
Take a look to the nutch.org sourcecode. It does what you are searching 
for.
HTH
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread Stefan Groschupf
Dave,
cool stuff, think aboout to contribute that to nutch.. ;-)!
Do you know:
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ?
Cheers,
Stefan
Am 01.07.2004 um 23:28 schrieb David Spencer:
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for "thread pool" shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp? 
s=thread%20pool&side=300&goal=500

This is the empty search form:
http://www.searchmorph.com/kat/tsearch.jsp
And the weblog entry has a few more links, esp useful if you don't  
know what a treemap is:

http://searchmorph.com/weblog/index.php?id=18
Oh: As a start, a treemap is a visualization technique, not  
java.util.Treemap. Bigger boxes show a higher score, and x,y location  
has no significance.

Enjoy,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[bug?] term frequency and empty content

2004-06-26 Thread Stefan Groschupf
Hi,
I notice some thing strange: (1.4-rc4)
Until I add a empty text to my index:
where text is "" or null;
 IndexWriter indexWriter = getIndexWriter();
 document.add(Field.Text(Corpus.TEXT, text, true));
 indexWriter.addDocument(document);
I see this in std.out: "No tvx file"
Furthermore IndexReader.terms() returns just null.
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


term frequence in hits

2004-06-26 Thread Stefan Groschupf
Hi,
another question, but first many thanks for the last hint, the new term 
frequency functionality of lucene is just GREAT! ;)
I have index a set of documents with different meta data, Language = DE 
or Language = EN.
Now i wish to get Term frequencies for DE and EN. The easiest solution 
would be to  create one index for the DE and one index for En but that 
is very unhandy for the other things i wish to do with the corpus.
Alternativly i would do a query and iterate of the documents and count 
the terms by my self, but this is very slow.

Is there another way I may be didn't see to get  term frequencies for a 
defined set of documents?

Thanks for providing any light! ;)
Greetings,
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


term vector

2004-06-23 Thread Stefan Groschupf
Hi,
sorry, a stupid question,
Is there a best practice to get the term vector of an document?
Is there any experience to do any kind of feature selection for 
dimension reducing like zipf laws or getting tf/idf of a term for the 
complete corpora.

Thanks for any hints.
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: hit score in 1.3 vs 1.4

2004-06-11 Thread Stefan Groschupf
Hi Erik,
in case we will meet one time and I sure since the world is small, I 
have to invite you to a beer! :-)
Thanks your suggestion works, recreating the index solved the problem...

Stefan
Am 11.06.2004 um 12:12 schrieb Erik Hatcher:
On Jun 11, 2004, at 5:51 AM, Stefan Groschupf wrote:
Hi,
I'm having a strange problem until upgrading lucene 1.3 to 1.4 rc4.
I'm using a third party component that include the old lucene 1.3 but 
i need to run the new 1.4 rc 4 in the same vm.
So i unpack the component jar, remove all lucene 1.3 classes and 
repack it again and just add the new lucene in the classpath.
So far everything running well, but the hits.score(i) method return 
for each hit 100 %. ;-o

Does someone has any idea where may be the problem can be?
I'm not sure you'll have much luck with two versions playing well in 
the sam VM together.  There are static variables used as well as JVM 
system properties that factor into configuration settings.  I don't 
have any specific recommendations, just a gut feeling that the issues 
are not going to be pleasant.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


hit score in 1.3 vs 1.4

2004-06-11 Thread Stefan Groschupf
Hi,
I'm having a strange problem until upgrading lucene 1.3 to 1.4 rc4.
I'm using a third party component that include the old lucene 1.3 but i 
need to run the new 1.4 rc 4 in the same vm.
So i unpack the component jar, remove all lucene 1.3 classes and repack 
it again and just add the new lucene in the classpath.
So far everything running well, but the hits.score(i) method return for 
each hit 100 %. ;-o

Does someone has any idea where may be the problem can be?
Thanks for any hints,
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: similarity of two texts

2004-05-31 Thread Stefan Groschupf
Lucene can't help you.
Search for text classification or text clustering.
Browse the tools section @ www.text-mining.org there you will found may  
be tools that can help you with this task.
In general some key words for your further search:

Feature extraction from text.
Data mining algorithms for clustering or classification.
One Algorithm you may be will found useful is "Support Vector Machine".
HTH
Stefan
P.S:
Support your local book store and order:
http://www.amazon.com/exec/obidos/tg/detail/-/1558605525/ 
qid=1086027371/sr=8-1/ref=sr_8_xs_ap_i1_xgl14/103-6852557-3809420? 
v=glance&s=books&n=507846
This book has interesting section for you.


Am 31.05.2004 um 20:10 schrieb uddam chukmol:
Hi,
I'm a newbie to Lucene and heard that it helps in the information  
retrieval process. However, my problem is not really related to the  
information retrieval but to the comparison of two texts. I think  
Lucene may help resolving it.

I would like to have a clue on how to compare two given texts and  
finally say how much they are similar.

Has anyone had this kind of experience? I will be very grateful to  
hear your ideas and your recommendations.

Thanks before hand!
Uddam CHUKMOL


-
Do you Yahoo!?
Friends.  Fun. Try the all-new Yahoo! Messenger
---
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Paid support for Lucene - SUMMARY

2004-01-30 Thread Stefan Groschupf
Am 30.01.2004 um 22:11 schrieb Stefan Groschupf:

JBoss Group  http://jboss.org/
Does jboss really support maven?
Sorry, doing 2 things at the same time is not good.
Should be:  "Does jboss really support lucene?"
Stefan

open technology:   www.media-style.com
open source:   www.weta-group.net
open discussion:www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Paid support for Lucene - SUMMARY

2004-01-30 Thread Stefan Groschupf
JBoss Group  http://jboss.org/
Does jboss really support maven?
I think they are more focused on its j2ee server.


open technology:   www.media-style.com
open source:   www.weta-group.net
open discussion:www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Paid support for Lucene

2004-01-29 Thread Stefan Groschupf
I will not, but I would work to get a degree from mit.edu. B-)
Just kidding, I wouldn't do that.
http://www.ai.mit.edu/research/sponsors/sponsors.shtml
Peace!
Stefan



I am willing as well.

Scott

On Jan 29, 2004, at 12:04 PM, Boris Goldowsky wrote:

Strangely, the web site does not seem to list any vendors who provide
incident support for Lucene.  That can't be right, can it?
Can anyone point me to organizations that would be willing to provide
support for Lucene issues?
Thanks,
Boris
--
Boris Goldowsky
[EMAIL PROTECTED]
www.goldowsky.com/consulting
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

open technology:   www.media-style.com
open source:   www.weta-group.net
open discussion:www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML tag filter...

2004-01-10 Thread Stefan Groschupf
If you browse the cvs of nutch.org you will found an implementation.

HTH
Stefan
Am 10.01.2004 um 19:43 schrieb [EMAIL PROTECTED]:

Hi group,

would it be possible to implement a Analyser who filters HTML code out 
of a
HTML page. As a result I would have only the text free of any tagging.

Is is maybe better to use other existing open source software for 
that? Did
somebody tried that here?

Cheers,
Ralf
--
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: boosting & StandardAnalyzer, stop words

2003-12-10 Thread Stefan Groschupf

Perhaps we'd better continue this on lucene-dev.
 

Ok, i will subscribe this list and request again. 
Thanks!
Stefan
--
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: boosting & StandardAnalyzer, stop words

2003-12-09 Thread Stefan Groschupf
Ype,

It's a bug, and there is a fix for this in the latest CVS
near the end of the QueryParser.jj file:
 // avoid boosting null queries, such as those caused by stop words
 if (q != null) {
   q.setBoost(f);
 }
 

I had checked out the latest sources from public cvs. The posted code 
lines abouve are on the QueryParser.jj but this does not fix the problem. ;/
What could it be?

*java.lang.NullPointerException
   at org.apache.lucene.queryParser.QueryParser.Term(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
   at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
*
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf


Damian Gajda wrote:

BTW. i may send You the partly working Lucene with Dmitrys code patched
in.
 

Yeah that would be very helpful.

Thanks!

--
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


boosting & StandardAnalyzer

2003-12-08 Thread Stefan Groschupf
Hi,

I notice something really strange.

I just tried the "document to query" thing with term frequencies and 
term bosting based on the term frequence.
The code itself take may be 3 minutes, but i spend around 2 hours to 
search a nullpointer exception i got in this line.

   query = QueryParser.parse(searchStringBuffer.toString(), 
IFieldNames.INSTANCE_VALUE, new StandardAnalyser());

I figure out this happen until my search string contains. "will"^13
I replace the StandardAnalyzer with a own Implementation of the Analyzer 
that just do nothing and now it works.
Look like the StandardAnalyser has problem until stop word removal, when 
term bosting is used.

Is that a bug or just a mistake by myself?

Stefan



  

--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
Damian Gajda wrote:

Hello I already have some experience with Dmitry's implementation.

Can you point me to Dmitry's code,so that i can take a look, i just had 
read about it

Feel free to contact me.
 

I will do! ;)
Thanks!
Stefan

--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf


A few people have asked, a few people have expressed interest.
 

I have to do some work for nutch but since I need the feature vector 
stuff for an commercial project I will try to implement it.
Someone wish to join me??? ;)

Stefan



--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term vector or document vector

2003-12-08 Thread Stefan Groschupf
Just to be sure since there was a lot of dicussion in the lists.
There is actually no solution available to get a term vector for a 
document or a TF/IDF feature vector for a document, isn't it?

Some one had work on such things?
Some wish to work on such things?
Stefan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Document Similarity

2003-12-08 Thread Stefan Groschupf
Hi Jing,

do you work on the task of document similarity?
I see nobody was answering your question.
To create a query out of an document would be very easy, but would it 
provide well results?
Document term vectors would provide more possibilities to use different 
data mining algorithms for clustering or classification.

Stefan

--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term vector (Damian patch)

2003-12-08 Thread Stefan Groschupf
Otis,

based on this discussion:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03350.html
Stefan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


term vector (Damian patch)

2003-12-08 Thread Stefan Groschupf
Hi there,

is Damian patch in the cvs or latest lucene release.
Allow this patch to recieve a term vector of  a document?
Thanks!

Stefan

--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: SearchBlox J2EE Search Component Version 1.1 released

2003-12-02 Thread Stefan Groschupf
Tun Lin wrote:

Anyone knows a search engine that supports xml formats? 
 

http://jakarta.apache.org/lucene/docs/lucene-sandbox/
see  SAX/ DOM XML demo.


--
open technology:  www.media-style.com
open source:  www.weta-group.net
open discussion:  www.text-mining.org


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
Herb,

On Friday 14 November 2003 13:39, Chong, Herb wrote:
 

you're describing ad-hoc solutions to a problem that have an effect, but
not one that is easily predictable. one can concoct all sorts of
combinations of the query operators that would have something of the effect
that i am describing. crossing sentence boundaries, however, can't be done
   

I'm not sure understand you right. You wish to make language based 
queries to a IR system?
If I'm right may be I can help you!
Here is presentation by the MIT that does something similar.
http://mitworld.mit.edu/stream/155/
That is a video > 1h. But the first talk will contain the part is 
interesting for you.
Leslie Pack Kaelbling: MIT Professor, Electrical Engineering and 
Computer Science hold it.
(the sequence with the two guys with a phone is interesting for you).
They just have 2 components around the "nl ir query". It is speech to 
text and text to speech.

What you can do is use a pos tagger (i.e. a maximum entropy model based 
or  Brill tagger if you just have english) and use a data mining 
algorithm for weight your terms.
May be you can use a hidden Markov model for that.

You can build this on top of lucene, shouldn't be that difficult.

But may be I understand you wrong.. ..

Cheers
Stefan
--
open technology: www.media-style.com
open source: www.weta-group.net
open discussion: www.text-mining.org


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
PA,

 But Lucene is an low level indexing library. 
I'm sure most people here will agree that lucene is much more than a 
_low level_ indexing library.

May be it is just a library, but definitely the *highest level* search 
technology available in the web for free.
You ride roughshod over the hard work of others.

You just talking b*** guy! After 3 days I have around 90 (!!!) 
postings from you in my inbox from lucene and nutch.
All of them are at least a summary of the first 3 entries of a google 
search or just bla bla.

You definitely miss to read the mailing list guidelines.
You can find it here and you should read them carefully!
http://jakarta.apache.org/site/mail.html This could be a useful link for 
you as well: http://www.dtcc.edu/cs/rfc1855.html

Since you don't wish to get personal mails, sorry to you, I have to post 
it here.
I'm sorry to all other as well, since this is normally a harmonic place 
for good IR conversation and the web is a open for everyone.
But this only working with an  etiquette and this guy really miss to 
read it. I'm sure I'm not the only one, that see it like this.

PA, can you do me a personal favour?! Install you a IRC Client and go to 
#rss! There is  a nice "send" aqua button as well.

Thanks!
Stefan
P.S. I'm really really sorry for this mail!!!

PP.S. It is not the first mailing list that PA is spamming see first entry:

http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&q=petite_abeille%40mac.com&btnG=Google+Search




Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
really cool Stuff!!!

maurits van wijland wrote:

Hi All and Marc,

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input component
for lucene could look like:

http://www.dawidweiss.com/projects/carrot/componentDescriptor"; framework  =
"Carrot2">
   http://localhost/weblucene/c2.jsp";
  infoURL  = "http://localhost/weblucene/";
   />

The c2.jsp file simply has to translate a resultlist into an XLM file such
as:

   
...
1.0
http://...
sum 1
snip 2
   
   
...
1.0
http://...
sum 2
snip 2
   

Feed this into the carrot system, and you will get a nice clustered
result list. The amazing part is of this clustering mechanism is that
the cluster labels are incredible, their great!
Then there is a open source project called Classifier4J that can
be used for classification, the oposite of clustering. These other
open source projects are a great addition to the Lucene system.
I hope this helps...

Marc, what are you building?? Maybe we can help!

Kind regards,

Maurits

- Original Message - 
From: "marc" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 5:15 PM
Subject: Document Clustering

Hi,

does anyone have any sample code/documentation available for doing document
based clustering using lucene?
Thanks,
Marc


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Stefan Groschupf


Marcel Stor wrote:

Stefan Groschupf wrote:
 

Hi,
   

How is document clustering different/related to text categorization?
 

Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
   

Would I be correct to say that you have to define a "distance threshold"
parameter in order to define when to build a new category for a certain
group?
 

I'm not sure. There are different data mining algorithms that could be used. Depends on this algoritm. I prefer Support vector machines(SVM). There you calculate distances of multi demensional vectors in a multidemensional "room".
One vector represent one document. 

Stefan




Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match in it. 
You group all documents with minimal distance together. 

Classification: you have already categories and samples for it, that help you to match other documents. 
You calculate document distances to the existing categories and put it in the category with smallest distance.

Cheers
Stefan
--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi Marc,

I'm working on it. Classification and Clustering as well.
I was planing doing it for nutch.org, but actually some guys there 
breakup some important basic work I already had done, so may be i will 
not contribute it there.
However it will be open source and I can notice you  if something useful 
is ready.
But it will take some weeks. I actually working on radical minimizing of 
feature selection

Cheers
Stefan


marc wrote:

Hi,

does anyone have any sample code/documentation available for doing document based clustering using lucene?

Thanks,
Marc
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf

Wouldn't mind joining in a joint approach, only problem is timing - it would
probably be late December before we could start putting the hours in.
 

We all do this just for fun, so no rush! However more people less work 
for everybody, faster results.
We only need a generic API but i had done some work already. ;)
See "Introducing in the nutch plug-in architecture 
" on 
http://www.media-style.com/nutch/
"how to write a parser" is not up to date anymore.

Cheers
Stefan


Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf
There is some ongoing work for nutch.org.
May be we can bundle all work together?! 
Nutch has alraeady a java *.doc, *.pdf parser as well .
Stefan

Pete Lewis wrote:

Hi Stefan

Using OpenOffice will enable you to parse 182 file formats, but its not a
pure java solution and you still need an alternate solution for pdfs.
I'd be interested in knowing whether anyone is working on a pure java
solution that would give us a single method for handling ms office
documments / pdfs / etc.
Cheers

Pete

- Original Message - 
From: "Stefan Groschupf" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 05, 2003 10:26 AM
Subject: Re: Index entire filesystem

 

I had write to this list some days ago, to announce a possibility to
parse 182 file formats.
There was a tiny bug report some days ago, i hope i can fix it.
Browse the archive to figure out more.

Cheers
Stefan
Marcel Stor wrote:

   

Hi all,

I'm thinkin' about writing a search tool for my filesystem. I know such
things exist already but programming it myself is much more fun ;-)
So, I would have Lucene crawl through my filesystem and pass each file
to an appropriate indexer (PDF -> PDFbox, etc.). Yes, I run a Windows
system and would depend on the file ending to distinguish the file type.
Is this a good idea in general? Is there a list of available indexer for
the the different file types? Any other comments are also welcome.
Regards,
Marcel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index entire filesystem

2003-11-05 Thread Stefan Groschupf
I had write to this list some days ago, to announce a possibility to 
parse 182 file formats.
There was a tiny bug report some days ago, i hope i can fix it.

Browse the archive to figure out more.

Cheers
Stefan
Marcel Stor wrote:

Hi all,

I'm thinkin' about writing a search tool for my filesystem. I know such
things exist already but programming it myself is much more fun ;-)
So, I would have Lucene crawl through my filesystem and pass each file
to an appropriate indexer (PDF -> PDFbox, etc.). Yes, I run a Windows
system and would depend on the file ending to distinguish the file type.
Is this a good idea in general? Is there a list of available indexer for
the the different file types? Any other comments are also welcome.
Regards,
Marcel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


zipf law?

2003-11-02 Thread Stefan Groschupf
Hi,

sorry a very stupid question does lucene zipf laws until indexing?

Thanks
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


182 file formats for lucene!!! was: Re: Exotic format indexing?

2003-10-30 Thread Stefan Groschupf
Hi there,

just to let you know, i had implement for the nutch project a plugin 
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.

It is really straight forward to use.

Found some info's and a link to the open source code here:
http://sourceforge.net/tracker/index.php?func=detail&aid=828517&group_id=59548&atid=491356
Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial 
formats, since information should be free. ;)

Cheers
Stefan






Ben Litchfield wrote:

Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.
Ben

On Thu, 30 Oct 2003, petite_abeille wrote:

 

Hello,

Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
popular question on this list...
The traditional approach seems to be to try to find some kind of format
specific reader to properly extract the textual part of such documents
for indexing. The drawback of such an approach is that its complicated
and cumborsome: many different formats, not that many Java libraries to
understand them all.
An alternative to such a mess could be perhaps to convert those
multitude of formats into something more or less standard and then
extract the text from that. But again, this doesn't seem to be such a
straightforward proposition. For example, one could image "printing"
every document to PDF and then convert the resulting PDF to text. Not a
piece of cake in Java.
Finally, a while back, somebody on this list mentioned quiet a
different approach: simply read the raw binary document and go fishing
for what looks like text. I would like to try that :)
Does anyone remember this proposal? Has anyone tried such an approach?

Thanks for any pointers.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best practice

2003-10-28 Thread Stefan Groschupf
William W wrote:

Hi Folks,
Is there any Lucene best practice ?
www.nutch.org ;)





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Jmeter

2003-10-15 Thread Stefan Groschupf
Elsa Hernandez wrote:

Hi, I would like to know if someone has used Jmeter to prove/test the 
performance of your web applications, or if someone could suggest a 
tool/application that they have used. Thank you. 


http://eclipsecolorer.sourceforge.net/index_profiler.html

Is the best i ever found.

HTH
Stefan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


pre analysing

2003-09-21 Thread Stefan Groschupf
Hi there,

I wish to run an pre analyzer that help me to choice the right analyzer 
I wish to run on my stream.
For instance i wish to analyze the language of my text and choice then 
an language dependent stop word remover.
Since it is a token stream an my language detection need the whole text 
to identify the language, how to implement a pre Analyzer?

I wish to do something like a pipeline.

Thanks for any hints.

Stefan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]