Fieldinformation from Index
Hello, I have two questions which might be easy to answer from a Lucene expert: 1) I need to know which fields a collection of Documents has (given the fact that not all docuemnts do necessaryly use all fields). This Documents are all stored in one index. Is there a way (with Lucene 1.2 or 1.3) to find out without going though each document and retrieving it? 2) I need to know which Analyzer was used to index a field. One important rule, as we all know, is to use the same analyzer for indexing and searching a field. Is this information stored in the index or in full responsibilty of the application developer? Karl -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene on PersonalJava ?? HELP!
Hello, thank you for the tip. I have solved the problem in a different way. If anybody else want to run Lucene on PJava, he/she might go for the same. I am using the cvm VM instead of the jeode VM. Then it works fine with Lucene 1.2 withtout any change in my code or in the Lucene code. Perhaps even with a newer version (but havn't tested yet). :-) Thank you anyway, Karl > On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote: > > did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to > JDK > > 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it > works. > > Mine doesn't. When conmparing the the code I cannot find any difference. > I > > search the index for a Query. > > > > I get an error saying that the method java.io.File.createNewFile() is > used > > in Lucene. I have checked Java 1.1.8 and indeed this method does not > exist. > > > > Beside the question, how it can work on my friends system with the same > > code, I am asking two more questions: > > > > 1) Did anybody here use Lucene on a PDA under Personal Java and can tell > > some experience? > > > > 2) Is there anything else I should try or something I have forgotten? > > It might be the constructor the the IndexReader or IndexSearcher that > you're using. You can pass in a string that points to the directory or a > file object instead. Lucene might being using > java.io.File.createNewFile() if you pass in a string. > > A simple grep should find out where it's being used. > > > > -- > Miles Barr <[EMAIL PROTECTED]> > Runtime Collective Ltd. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene on PersonalJava ?? HELP!
Hi, did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works. Mine doesn't. When conmparing the the code I cannot find any difference. I search the index for a Query. I get an error saying that the method java.io.File.createNewFile() is used in Lucene. I have checked Java 1.1.8 and indeed this method does not exist. Beside the question, how it can work on my friends system with the same code, I am asking two more questions: 1) Did anybody here use Lucene on a PDA under Personal Java and can tell some experience? 2) Is there anything else I should try or something I have forgotten? Thanks for your help, Karl -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1
I have a colleague which uses Lucene 1.3 on PersonalJava (equally to Java 1.1.8). I can't find a significant difference to his code (sill searching) but he did not many any changes. He did also not recompile Lucene 1.3 on 1.1.8 etc. It must be something simple. I will look for that switch... In the meantime, I am thankful for any other help. Cheers, Karl > On Tuesday 08 February 2005 18:49, sergiu gordea wrote: > > Karl Koch wrote: > ... > > >>A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has > > >>occurred in : > > >> 'org/apache/lucene/store/FSDirectory.getDirectory > > >>(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting > > >>method. > > >> Please report this error in detail to > > >>http://java.sun.com/cgi-bin/bugreport.cgi > > Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things > down when I was using 1.1 (1.1.8?), but it might you help now... > > Regards, > Paul Elschot > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HELP! JIT error when searching... Lucene 1.3 on Java 1.1
When I switch to Java 1.2, I can also not run it. Also I cannot index anything. I have no idea why... Can sombody help me? Karl > Hello all, > > I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that > because I want to run a search with a PDA using Java 1.1). > > However, when I run my code. I get the following error: > > -- > > A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has > occurred in : > 'org/apache/lucene/store/FSDirectory.getDirectory > (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting > method. > Please report this error in detail to > http://java.sun.com/cgi-bin/bugreport.cgi > > Exception occured in StandardSearch:search(String, String[], String)! > java.lang.IllegalMonitorStateException: current thread not owner > at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312) > at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled > Code) > > -- > > The error does not occur when I run it under Java 1.4. > > What do I do wrong and what do I need to change in order to make it work. > It > must be my code. Here the code relevant to this error (the search method). > > > public static Result search(String queryString, String[] searchFields, > String indexDirectory) { > // create access to index > StandardAnalyzer analyser = new StandardAnalyzer(); > Hits hits = null; > Result result = null; > try { > fsDirectory = > FSDirectory.getDirectory(StandardSearcher.indexDirectory, false); > IndexSearcher searcher = new IndexSearcher(fsDirectory); > ... > } > > > What is wrong here? > > Best Regards, > Karl > > -- > DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! > AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JIT error when searching... Lucene 1.3 on Java 1.1
Hello all, I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that because I want to run a search with a PDA using Java 1.1). However, when I run my code. I get the following error: -- A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has occurred in : 'org/apache/lucene/store/FSDirectory.getDirectory (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting method. Please report this error in detail to http://java.sun.com/cgi-bin/bugreport.cgi Exception occured in StandardSearch:search(String, String[], String)! java.lang.IllegalMonitorStateException: current thread not owner at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312) at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled Code) -- The error does not occur when I run it under Java 1.4. What do I do wrong and what do I need to change in order to make it work. It must be my code. Here the code relevant to this error (the search method). public static Result search(String queryString, String[] searchFields, String indexDirectory) { // create access to index StandardAnalyzer analyser = new StandardAnalyzer(); Hits hits = null; Result result = null; try { fsDirectory = FSDirectory.getDirectory(StandardSearcher.indexDirectory, false); IndexSearcher searcher = new IndexSearcher(fsDirectory); ... } What is wrong here? Best Regards, Karl -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Retrieve all documents - possible?
Hi, is it possible to retrieve ALL documents from a Lucene index? This should then actually not be a search... Karl -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Some questions about index...
Thank you four fast, straight and usefull comments. Keeping in mind what was said, did anybody actually think about implementing a kind of database layer on top of a lucene index. A database would be an index, collumns would be fields and entries documents. At least everything which would only require a single table could be done. A SELECT would be search ... :-) Karl > > On Feb 5, 2005, at 10:04 AM, Karl Koch wrote: > > 1) Can I store all the information of the text file, but also apply a > > analyser. E.g. I use the StopAnalyzer. After finding the document, I > > want to > > extract the original text also from the index. Does this require that I > > store the information twice in two different fields (one indexed and > > one > > unindexed) ? > > You should use a single stored, tokenized, and indexed field for this > purpose. Be cautious of how you construct the Field object to achieve > this. > > > 2) I would like to extract information from the index which can found > > in a > > boolean way. I know that Lucene is a VSM which provides Boolean > > operators. > > This however does not change its functioning. For example, I have a > > field > > with contains an ID number and I want to use the search like a database > > operatation (e.g. to find the document with id=1). I can solve the > > problem > > by searching with query "id:1". However, this does not ensure that I > > will > > only get one result. Usually the first result is the document I want. > > But it > > could happen, that this sometimes does not work. > > Why wouldn't it work? For ID-type fields, use a Field.Keyword (stored, > indexed, but not tokenized). Search for a specific ID using a > TermQuery (don't use QueryParser for this, please). If the ID values > are unique, you'll either get zero or one result. > > > What happens if I should > > get no results? I guess if I search for id=5 and 5 did not exist I > > would > > probably get 50, 51, .. just because the contain 5. Did somebody work > > with > > this and can suggest a stable solution? > > No, this would not be the case, unless you're analyzing the ID field > with some strange character-by-character analyzer or doing a wildcard > "*5*" type query. > > > A good solution for these two questions would help me avoiding a > > database > > which would need to replicate most the data which I already have in my > > Lucene index... > > You're on the right track and avoiding a database when it is overkill > or duplicative is commendable :) > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Some questions about index...
Hi all, for simplicity reasons I would like to use the index as my data storage whilst using the advantage of the highly optimised Lucene index structure. 1) Can I store all the information of the text file, but also apply a analyser. E.g. I use the StopAnalyzer. After finding the document, I want to extract the original text also from the index. Does this require that I store the information twice in two different fields (one indexed and one unindexed) ? 2) I would like to extract information from the index which can found in a boolean way. I know that Lucene is a VSM which provides Boolean operators. This however does not change its functioning. For example, I have a field with contains an ID number and I want to use the search like a database operatation (e.g. to find the document with id=1). I can solve the problem by searching with query "id:1". However, this does not ensure that I will only get one result. Usually the first result is the document I want. But it could happen, that this sometimes does not work. What happens if I should get no results? I guess if I search for id=5 and 5 did not exist I would probably get 50, 51, .. just because the contain 5. Did somebody work with this and can suggest a stable solution? A good solution for these two questions would help me avoiding a database which would need to replicate most the data which I already have in my Lucene index... Kind Regards, Karl -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
The link does not work. > > One which we've been using can be found at: > http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ > > We absolutely need to be able to recover gracefully from malformed > HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there > failed this criterion when we started our effort. The above one is > kind of SAX-y but doesn't fall over at the sight of a real web page > ;-) > > Ian > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better? - Thread closed
Thank you, I will do that. > Karl Koch wrote: > > >I appologise in advance, if some of my writing here has been said before. > >The last three answers to my question have been suggesting pattern > matching > >solutions and Swing. Pattern matching was introduced in Java 1.4 and > Swing > >is something I cannot use since I work with Java 1.1 on a PDA. > > > > > I see, > > In this case you can read line by line your HTML file and then write > something like this: > > String line; > int startPos, endPos; > StringBuffer text = new StringBuffer(); > while((line = reader.readLine()) != null ){ > startPos = line.indexOf(">"); > endPos = line.indexOf("<"); > if(startPos >0 && endPos > startPos) > text.append(line.substring(startPos, endPos)); > } > > This is just a sample code that should work if you have just one tag per > line in the HTML file. > This can be a start point for you. > > Hope it helps, > > Best, > > Sergiu > > >I am wondering if somebody knows a piece of simple sourcecode with low > >requirement which is running under this tense specification. > > > >Thank you all, > >Karl > > > > > > > >>No one has yet mentioned using ParserDelegator and ParserCallback that > >>are part of HTMLEditorKit in Swing. I have been successfully using > >>these classes to parse out the text of an HTML file. You just need to > >>extend HTMLEditorKit.ParserCallback and override the various methods > >>that are called when different tags are encountered. > >> > >> > >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: > >> > >> > >> > >>>Three HTML parsers(Lucene web application > >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in > >>>Lucene FAQ > >>>1.3.27.Which is the best?Can it filter tags that are > >>>auto-created by MS-word 'Save As HTML files' function? > >>> > >>> > >>-- > >>Bill Tschumy > >>Otherwise -- Austin, TX > >>http://www.otherwise.com > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory constraints. I do not think CPU performance is a big issues though. But I have other parts in my application which use quite a lot of memory and soemthing run short. I therefore do not look into solutions which build up tag trees etc. More like a solution who reads a stream of HTML and transforms it into a stream of text. I see your point of using an external program. I am however not entirely sure if this is available. Also it would be much simpler to have a 3-5 kB solution in Java, perhaps encapsulated in a class which does the job without the need for advanced libraries which need 100-200 KB on my internal storage. I hope I could clarify my situation now. Cheers, Karl > Karl Koch wrote: > > >Hello Sergiu, > > > >thank you for your help so far. I appreciate it. > > > >I am working with Java 1.1 which does not include regular expressions. > > > > > Why are you using Java 1.1? Are you so limited in resources? > What operating system do you use? > I asume that you just need to index the html files, and you need a > html2txt conversion. > If an external converter si a solution for you, you can use > Runtime.executeCommnand(...) to run the converter that will extract the > information from your HTMLs > and generate a .txt file. Then you can use a reader to index the txt. > > As I told you before, the best solution depends on your constraints > (time, effort, hardware, performance) and requirements :) > > Best, > > Sergiu > > >Your turn ;-) > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>I am in control of the html, which means it is well formated HTML. I > use > >>>only HTML files which I have transformed from XML. No external HTML > (e.g. > >>>the web). > >>> > >>>Are there any very-short solutions for that? > >>> > >>> > >>> > >>> > >>if you are using only correct formated HTML pages and you are in control > >>of these pages. > >>you can use a regular exprestion to remove the tags. > >> > >>something like > >>replaceAll("<*>",""); > >> > >>This is the ideea behind the operation. If you will search on google you > >>will find a more robust > >>regular expression. > >> > >>Using a simple regular expression will be a very cheap solution, that > >>can cause you a lot of problems in the future. > >> > >> It's up to you to use it > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>>>Karl Koch wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Hi, > >>>>> > >>>>>yes, but the library your are using is quite big. I was thinking that > a > >>>>> > >>>>> > >>>>> > >>>>> > >>>>5kB > >>>> > >>>> > >>>> > >>>> > >>>>>code could actually do that. That sourceforge project is doing much > >>>>> > >>>>> > >>more > >> > >> > >>>>>than that but I do not need it. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>you need just the htmlparser.jar 200k. > >>>>... you know ... the functionality is strongly correclated with the > >>>> > >>>> > >>size. > >> > >> > >>>> You can use 3 lines of code with a good regular expresion to > eliminate > >>>>the html tags, > >>>>but this won't give you any guarantie that the text from the bad > >>>>fromated html files will be > >>>>correctly extracted... > >>>> > >>>> Best, > >>>> > >>>> Sergiu > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Karl > >>>>> > >>>>> >
Re: which HTML parser is better?
I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl > No one has yet mentioned using ParserDelegator and ParserCallback that > are part of HTMLEditorKit in Swing. I have been successfully using > these classes to parse out the text of an HTML file. You just need to > extend HTMLEditorKit.ParserCallback and override the various methods > that are called when different tags are encountered. > > > On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: > > > Three HTML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > -- > Bill Tschumy > Otherwise -- Austin, TX > http://www.otherwise.com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that in a single class or even method called by another part in my Java application. It should also run on Java 1.1 and it should be small and simple. As I said before, I am in control of the HTML and it will be well formated, because I generate it from XML using XSLT. Karl > If you are not married to Java: > http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm > > Otis > > --- sergiu gordea <[EMAIL PROTECTED]> wrote: > > > Karl Koch wrote: > > > > >I am in control of the html, which means it is well formated HTML. I > > use > > >only HTML files which I have transformed from XML. No external HTML > > (e.g. > > >the web). > > > > > >Are there any very-short solutions for that? > > > > > > > > if you are using only correct formated HTML pages and you are in > > control > > of these pages. > > you can use a regular exprestion to remove the tags. > > > > something like > > replaceAll("<*>",""); > > > > This is the ideea behind the operation. If you will search on google > > you > > will find a more robust > > regular expression. > > > > Using a simple regular expression will be a very cheap solution, that > > > > can cause you a lot of problems in the future. > > > > It's up to you to use it > > > > Best, > > > > Sergiu > > > > >Karl > > > > > > > > > > > >>Karl Koch wrote: > > >> > > >> > > >> > > >>>Hi, > > >>> > > >>>yes, but the library your are using is quite big. I was thinking > > that a > > >>> > > >>> > > >>5kB > > >> > > >> > > >>>code could actually do that. That sourceforge project is doing > > much more > > >>>than that but I do not need it. > > >>> > > >>> > > >>> > > >>> > > >>you need just the htmlparser.jar 200k. > > >>... you know ... the functionality is strongly correclated with the > > size. > > >> > > >> You can use 3 lines of code with a good regular expresion to > > eliminate > > >>the html tags, > > >>but this won't give you any guarantie that the text from the bad > > >>fromated html files will be > > >>correctly extracted... > > >> > > >> Best, > > >> > > >> Sergiu > > >> > > >> > > >> > > >>>Karl > > >>> > > >>> > > >>> > > >>> > > >>> > > >>>> Hi Karl, > > >>>> > > >>>>I already submitted a peace of code that removes the html tags. > > >>>>Search for my previous answer in this thread. > > >>>> > > >>>> Best, > > >>>> > > >>>> Sergiu > > >>>> > > >>>>Karl Koch wrote: > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>>>Hello, > > >>>>> > > >>>>>I have been following this thread and have another question. > > >>>>> > > >>>>>Is there a piece of sourcecode (which is preferably very short > > and > > >>>>> > > >>>>> > > >>simple > > >> > > >> > > >>>>>(KISS)) which allows to remove all HTML tags from HTML content? > > HTML > > >>>>> > > >>>>> > > >>3.2 > > >> > > >> > > >>>>>would be enough...also no frames, CSS, etc. > > >>>>> > > >>>>>I do not need to have the HTML strucutre tree or any other > > structure > > >>>>> > > >>>>> > > >>but > > >> > > >> > > >>>>>need a facility to clean up HTML into its normal underlying > > content > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>&g
Re: which HTML parser is better?
Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Your turn ;-) Karl > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > can cause you a lot of problems in the future. > > It's up to you to use it > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing much > more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the > size. > >> > >> You can use 3 lines of code with a good regular expresion to eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>>> Hi Karl, > >>>> > >>>>I already submitted a peace of code that removes the html tags. > >>>>Search for my previous answer in this thread. > >>>> > >>>> Best, > >>>> > >>>> Sergiu > >>>> > >>>>Karl Koch wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Hello, > >>>>> > >>>>>I have been following this thread and have another question. > >>>>> > >>>>>Is there a piece of sourcecode (which is preferably very short and > >>>>> > >>>>> > >>simple > >> > >> > >>>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML > >>>>> > >>>>> > >>3.2 > >> > >> > >>>>>would be enough...also no frames, CSS, etc. > >>>>> > >>>>>I do not need to have the HTML strucutre tree or any other structure > >>>>> > >>>>> > >>but > >> > >> > >>>>>need a facility to clean up HTML into its normal underlying content > >>>>> > >>>>> > >>>>> > >>>>> > >>>>before > >>>> > >>>> > >>>> > >>>> > >>>>>indexing that content as a whole. > >>>>> > >>>>>Karl > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>I think that depends on what you want to do. The Lucene demo parser > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>does > >>>> > >>>> > >>>> > >>>> > >>>>>>simple mapping of HTML files into Lucene Documents; it does not give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >
Re: which HTML parser is better?
I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? Karl > Karl Koch wrote: > > >Hi, > > > >yes, but the library your are using is quite big. I was thinking that a > 5kB > >code could actually do that. That sourceforge project is doing much more > >than that but I do not need it. > > > > > you need just the htmlparser.jar 200k. > ... you know ... the functionality is strongly correclated with the size. > > You can use 3 lines of code with a good regular expresion to eliminate > the html tags, > but this won't give you any guarantie that the text from the bad > fromated html files will be > correctly extracted... > > Best, > > Sergiu > > >Karl > > > > > > > >> Hi Karl, > >> > >> I already submitted a peace of code that removes the html tags. > >> Search for my previous answer in this thread. > >> > >> Best, > >> > >> Sergiu > >> > >>Karl Koch wrote: > >> > >> > >> > >>>Hello, > >>> > >>>I have been following this thread and have another question. > >>> > >>>Is there a piece of sourcecode (which is preferably very short and > simple > >>>(KISS)) which allows to remove all HTML tags from HTML content? HTML > 3.2 > >>>would be enough...also no frames, CSS, etc. > >>> > >>>I do not need to have the HTML strucutre tree or any other structure > but > >>>need a facility to clean up HTML into its normal underlying content > >>> > >>> > >>before > >> > >> > >>>indexing that content as a whole. > >>> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>> > >>>>I think that depends on what you want to do. The Lucene demo parser > >>>> > >>>> > >>does > >> > >> > >>>>simple mapping of HTML files into Lucene Documents; it does not give > you > >>>> > >>>> > >>a > >> > >> > >>>>parse tree for the HTML doc. CyberNeko is an extension of Xerces > (uses > >>>> > >>>> > >>>> > >>>> > >>>the > >>> > >>> > >>> > >>> > >>>>same API; will likely become part of Xerces), and so maps an HTML > >>>> > >>>> > >>document > >> > >> > >>>>into a full DOM that you can manipulate easily for a wide range of > >>>>purposes. I haven't used JTidy at an API level and so don't know it > as > >>>> > >>>> > >>>> > >>>> > >>>well -- > >>> > >>> > >>> > >>> > >>>>based on its UI, it appears to be focused primarily on HTML validation > >>>> > >>>> > >>and > >> > >> > >>>>error detection/correction. > >>>> > >>>>I use CyberNeko for a range of operations on HTML documents that go > >>>> > >>>> > >>beyond > >> > >> > >>>>indexing them in Lucene, and really like it. It has been robust for > me > >>>> > >>>> > >>so > >> > >> > >>>>far. > >>>> > >>>>Chuck > >>>> > >>>> > -Original Message- > >>>> > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >>>> > Sent: Tuesday, February 01, 2005 1:15 AM > >>>> > To: lucene-user@jakarta.apache.org > >>>> > Subject: which HTML parser is better? > >>>> > > >>>> > Three HTML parsers(Lucene web application > >>>> > demo,CyberNeko HTML Parser,JTidy) are mentioned in > >>>> > Lucene FAQ > >>>> > 1.3.27.Which is the best?Can it filter tags that are > >>>> > auto-created by MS-word 'Save As HTML files' function? > >>>> > > >>>> > _ > >>>> > Do You Yahoo!? > >>>> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > >>>> > http://music.yisou.com/ > >>>> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > >>>> > http://image.yisou.com > >>>> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > >>>> > > >>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >>>> > il_1g/ > >>>> > > >>>> > > >>>> > >>>> > >>- > >> > >> > >>>> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> > For additional commands, e-mail: > [EMAIL PROTECTED] > >>>> > >>>> > >>>>- > >>>>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. Karl > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > > Sergiu > > Karl Koch wrote: > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short and simple > >(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other structure but > >need a facility to clean up HTML into its normal underlying content > before > >indexing that content as a whole. > > > >Karl > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo parser > does > >>simple mapping of HTML files into Lucene Documents; it does not give you > a > >>parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses > >> > >> > >the > > > > > >>same API; will likely become part of Xerces), and so maps an HTML > document > >>into a full DOM that you can manipulate easily for a wide range of > >>purposes. I haven't used JTidy at an API level and so don't know it as > >> > >> > >well -- > > > > > >>based on its UI, it appears to be focused primarily on HTML validation > and > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > beyond > >>indexing them in Lucene, and really like it. It has been robust for me > so > >>far. > >> > >>Chuck > >> > >> > -Original Message- > >> > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >> > Sent: Tuesday, February 01, 2005 1:15 AM > >> > To: lucene-user@jakarta.apache.org > >> > Subject: which HTML parser is better? > >> > > >> > Three HTML parsers(Lucene web application > >> > demo,CyberNeko HTML Parser,JTidy) are mentioned in > >> > Lucene FAQ > >> > 1.3.27.Which is the best?Can it filter tags that are > >> > auto-created by MS-word 'Save As HTML files' function? > >> > > >> > _ > >> > Do You Yahoo!? > >> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > >> > http://music.yisou.com/ > >> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > >> > http://image.yisou.com > >> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > >> > > >>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >> > il_1g/ > >> > > >> > > - > >> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> > For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl > I think that depends on what you want to do. The Lucene demo parser does > simple mapping of HTML files into Lucene Documents; it does not give you a > parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the > same API; will likely become part of Xerces), and so maps an HTML document > into a full DOM that you can manipulate easily for a wide range of > purposes. I haven't used JTidy at an API level and so don't know it as well -- > based on its UI, it appears to be focused primarily on HTML validation and > error detection/correction. > > I use CyberNeko for a range of operations on HTML documents that go beyond > indexing them in Lucene, and really like it. It has been robust for me so > far. > > Chuck > > > -Original Message- > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 01, 2005 1:15 AM > > To: lucene-user@jakarta.apache.org > > Subject: which HTML parser is better? > > > > Three HTML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > > > > _ > > Do You Yahoo!? > > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > > http://music.yisou.com/ > > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > > http://image.yisou.com > > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > > > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > > il_1g/ > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
XML index
Hi, I want to use kXML with Lucene to index XML files. I think it is possible to dynamically assign Node names as Document fields and Node texts as Text (after using an Analyser). I have seen some XML indexing in the Sandbox. Is anybody here which has done something with a thin pull parser (perhaps even kXML)? Does anybody know of a project or some sourcecode available which covers this topic? Karl -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Different Documents (with fields) in one index?
Hello all, perhaps not such a sophisticated question: I would like to have a very diverse set of documents in one index. Depending on the inside of text documents, I would like to put part of the text in different fields. This means in the searches, when searching a particular field, some of those documents won't be addressed at all. Is it possible to have different kinds of Documents with different index fields in ONE index? Or do I need one index for each set? Karl -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene on JSE 1.1.8...
Hello, does somebody here know, which Lucene version runs on Java 1.1? Of course I would like to run the best (latest) version possible :-) Karl -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
VSpace Model Index <-> Prob. Model Index - Difference?
Hello group, coming back to the discussion about probabilistic and vector space model (which occured here some time ago), I would like to ask something related. I only know the index structure Lucene offers. Does a IR system, based on the probabilistic model (e.g. Okapi) look different from a VS model? If yes, why? I hope this questions is not too stupid. I am mainly interested because of some theoretical background... Karl > Uh, there are lots of ways to construct an inverted index. > Citeseer will give you more than you can read on this topic. > > As for Lucene, see File Formats section on the site. > > Otis > > --- Karl Koch <[EMAIL PROTECTED]> wrote: > > If I create an standard index, what does Lucene store in this index? > > > > What should be stored in an index at least? Just a link to the file > > and > > keywords? Or also wordnumbers? What else? > > > > Does somebody know a paper which discusses this problem of "what to > > put in > > an good universal IR index" ? > > > > Cheers, > > Karl > > > > -- > > +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter > > Virenschutz +++ > > 100% Virenerkennung nach Wildlist. Infos: > > http://www.gmx.net/virenschutz > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++ 100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene index - information
If I create an standard index, what does Lucene store in this index? What should be stored in an index at least? Just a link to the file and keywords? Or also wordnumbers? What else? Does somebody know a paper which discusses this problem of "what to put in an good universal IR index" ? Cheers, Karl -- +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++ 100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search one keyword in two fields - How?
Hi, hat do I need to search for one single keyword in two fields of one index? Can someone provide an easy example or tell me where to find it? Cheers, Karl -- +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++ 100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Hello Doug, that sounds interesting to me. I refer to a paper written by NIST about Relevance Feedback which was doing test with 20 - 200 words. This is why I thought it might be good to be able to use all non stopwords of a document for that and see what is happening. Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? By the way, could you send me the code of Dmitry about the Vector extension. I have been asking in another thread but I did not get it so far. I really would like to have a look... Also it would be nice to know about any status regarding the progress of integrating it in Lucene 1.3. Who is working on it and how could I contribute? Cheers, Karl > Andrzej Bialecki wrote: > > Karl Koch wrote: > >> I actually wanted to add a large amount of text from an existing > >> document to > >> find a close related one. Can you suggest another good way of doing > >> this. > > > > You should try to reduce the dimensionality by reducing the number of > > unique features. In this case, you could for example use only keywords > > (or key phrases) instead of the full content of documents. > > Indeed, this is a good approach. In my experience, six or eight terms > are usually enough, and they needn't all be required. > > Doug > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Hi Doug, thank you for the answer so far. I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. A direct match will not occur anyway. How can I make a most Vector Space Model (VSM) like query (each word a dimension value - find documents close to that)? You know as good as I that the standard VSM does not have any Boolean logic inside... how do I need to formuate the query to make it as much similar to a vector in order to find similar document in the vector space of the Lucene index? Cheers, Karl > setMaxClauseCount determines the maximum number of clauses, which is not > your problem here. Your problem is with required clauses. There may > only be a total of 31 required (or prohibited) clauses in a single > BooleanQuery. If you need more, then create more BooleanQueries and > combine them with another BooleanQuery. Perhaps this could be done > automatically, but I've never heard anyone encounter this limit before. > Do you really mean for 32 different terms to be required? Do any > documents actually match this query? > > Doug > > Karl Koch wrote: > > Hi group, > > > > I run over a IndexOutOfBoundsException: > > > > -> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited > > clauses in query. > > > > The reason: I have more then 32 BooleanCauses. From the Mailinglist I > got > > the info how to set the maxiumum number of clauses higher before a loop: > > > > ... > > myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE); > > while (true){ > > Token token = tokenStream.next(); > > if (token == null) { > > break; > > } > > myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())), > true, > > false); > > } ... > > > > However the error still remains, why? > > > > Karl > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to limit terms in the index
Hello, I think you have to write your own Analyser who filters out all other words before providing the text to an indexer. karl > Hi, > I'm using Lucene to index documents and I want just to limit > the terms indexed by a list of terms provided by an Ontology. > May someone help me to know how can I do that ? > > Thanks, > GD > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: difference in javadoc and faq similarity expression
I would rely on the JavaDoc since this one is up to date. The latest version 1.3 final is just a few weeks old. Some entries in the FAQ however are still from 2001... Cheers, Karl > hy, > i have troubles in find the correspondance betwwen the javadoc and faq > similarity expression > > in the Similarity Javadoc > > score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) * > lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ] > > in the FAQ > > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) > * > coord_q_d > > In FAQ | In Javadoc > 1 / norm_q = queryNorm(q) > 1 / norm_d_t=lengthNorm(t.field in d) > coord_q_d=coord(q,d) > boost_t=getBoost(t.field in d) > idf_t=idf(t) > tf_d=tf(t in d) > > but > where is the javadoc expression for "tf_q" faq expression > > nicolas > > - Original Message - > From: "Nicolas Maisonneuve" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Sunday, January 18, 2004 9:33 PM > Subject: Re: theorical informations > > > > thanks Karl ! > > > > - Original Message - > > From: "Karl Koch" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Sunday, January 18, 2004 9:22 PM > > Subject: Re: theorical informations > > > > > > > Actually, finding an answer to this question is not really important. > More > > > important is if you can do what you want with it. If you result comes > from > > a > > > prob. model or a vector space model, who cares if you just want to > give > a > > > query and back a hit list of results? > > > > > > Possibliy some people here will strongly disagree... ;-) (?) > > > > > > Karl > > > > > > > Hello Nicolas, > > > > > > > > I am sure you mean IR (Information Retrieval) Model. Lucene > implements > a > > > > Vector Space Model with integrated Boolean Model. This means the > Boolean > > > > model > > > > is integrated with a Boolean query language but mapped into the > Vector > > > > Space. > > > > Therefore you have ranking even though the traditional Boolean model > > does > > > > not > > > > support this. Cosine similarity is used to measure similarity > between > > > > documents and the query. You can find this in a very long dicussion > here > > > > when you > > > > search the archive... > > > > > > > > Karl > > > > > > > > > hy , > > > > > i have 2 theorycal questions : > > > > > > > > > > i searched in the mailing list the R.I. model implemented in > Lucene > , > > > > > but no precise answer. > > > > > > > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean > Model, > > > > > Vector Model,Probabilist Model, etc... ) > > > > > > > > > > 2) What is the theory Similarity function implemented in Lucene > > > > > (Euclidian, Cosine, Jaccard, Dice) > > > > > > > > > > (why this important informations is not in the Lucene Web site or > in > > the > > > > > > > > > faq ? ) > > > > > > > > > > > > > -- > > > > +++ GMX - die erste Adresse für Mail, Message, More +++ > > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR > http://www.gmx.net/topmail > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > -- > > > +++ GMX - die erste Adresse für Mail, Message, More +++ > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
setMaxClauseCount ??
Hi group, I run over a IndexOutOfBoundsException: -> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. The reason: I have more then 32 BooleanCauses. From the Mailinglist I got the info how to set the maxiumum number of clauses higher before a loop: ... myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE); while (true){ Token token = tokenStream.next(); if (token == null) { break; } myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())), true, false); } ... However the error still remains, why? Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: theorical informations
Actually, finding an answer to this question is not really important. More important is if you can do what you want with it. If you result comes from a prob. model or a vector space model, who cares if you just want to give a query and back a hit list of results? Possibliy some people here will strongly disagree... ;-) (?) Karl > Hello Nicolas, > > I am sure you mean IR (Information Retrieval) Model. Lucene implements a > Vector Space Model with integrated Boolean Model. This means the Boolean > model > is integrated with a Boolean query language but mapped into the Vector > Space. > Therefore you have ranking even though the traditional Boolean model does > not > support this. Cosine similarity is used to measure similarity between > documents and the query. You can find this in a very long dicussion here > when you > search the archive... > > Karl > > > hy , > > i have 2 theorycal questions : > > > > i searched in the mailing list the R.I. model implemented in Lucene , > > but no precise answer. > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, > > Vector Model,Probabilist Model, etc... ) > > > > 2) What is the theory Similarity function implemented in Lucene > > (Euclidian, Cosine, Jaccard, Dice) > > > > (why this important informations is not in the Lucene Web site or in the > > > faq ? ) > > > > -- > +++ GMX - die erste Adresse für Mail, Message, More +++ > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: theorical informations
Hello Nicolas, I am sure you mean IR (Information Retrieval) Model. Lucene implements a Vector Space Model with integrated Boolean Model. This means the Boolean model is integrated with a Boolean query language but mapped into the Vector Space. Therefore you have ranking even though the traditional Boolean model does not support this. Cosine similarity is used to measure similarity between documents and the query. You can find this in a very long dicussion here when you search the archive... Karl > hy , > i have 2 theorycal questions : > > i searched in the mailing list the R.I. model implemented in Lucene , > but no precise answer. > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, > Vector Model,Probabilist Model, etc... ) > > 2) What is the theory Similarity function implemented in Lucene > (Euclidian, Cosine, Jaccard, Dice) > > (why this important informations is not in the Lucene Web site or in the > faq ? ) > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Extracting particular document from index
Hi all, I have only the file information about its location and the name of the file. I want to know two things. There is also an index made from suche files also containing the file I am looking for. I want to know two things: 1) first of all, all fields witch can be searched within this index. 2) secondly, the content as it is represented in each field. As you can see, I am asking for the way back. From the file itself I cannot interfere how it was indexed. The indexer has parsed and devided it somehow. Something I dunno know. This information should be in the index. I am basically looking for a functionalty like this: Index index = new Index("X:/myindex"); Field[] fields = index.getAllFields(); String[] fieldNames = new String[fields.length]; for (int i = 0; i < fields.length; i++){ fieldNames[i] = fields[i].stringValue(); } Using the field name I know now, I could perform a search and get the first entry of the Hits list, which would be my Document. :-) Does something like that exist in Lucene? Cheers mates, Karl > On Jan 18, 2004, at 11:15 AM, Karl Koch wrote: > > lets say I have an index with documents encoded in two fields > > "filename" and > > "data". Is it possible to extract a file from which I know the filename > > directly from this index without performing any search. Like a random > > access like > > in a filesystem? > > It is still technically a "search", but a TermQuery will be basically > direct access to the document(s) matching that term. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Gettting all index fields of an index
How can I get a list of all fields in an index from which I know only the directory string? Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Extracting particular document from index
Hi all, lets say I have an index with documents encoded in two fields "filename" and "data". Is it possible to extract a file from which I know the filename directly from this index without performing any search. Like a random access like in a filesystem? Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching for similar documents in Lucene
Hello all, how can I find the most similar document from one given document? Do I need to perform one search for each field in the Docuemnt and merge the resulting Hit lists? Maybe somebody has already done that and can give me hints / maybe an example? Cheers and good night, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Closing the IndexSearcher object
Hi all, I have a search method who is used by many programs with different queries. I therefore do not want to close the IndexSearch object to allow other programs to reuse it. Does this have any side effects (e.g. does the IndexSearcher object contain state information)? Would it be better to instanciate always a new IndexSearch object and close it after usage? Cheers, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexReader.document(int i)
Hello back, according to the JavaDoc it means: "Returns the stored fields of the nth Document in this index." In your case than if would mean the greater n the younger the doc. However I am not sure how you can create such an index. I think you should have a look to the Luke project, which allows you ot access and look into Lucene Indices. If I did not answer your question, please explain a little bit further... Karl > hy, > i would like to know > in the IndexReader.document(int i) > what is this number i ? > if the the first document is the oldest document indexed > and the last the youngest ? (so we can sort by date easyly) ? > > thank in advance > > nico -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance Feedback (2)
Hello all, oh, I just found a mail from Doug where he wrote that Dmitry Serebrennikov developed something who provides Document vector access: > Dmitry Serebrennikov [dmitrys?X0040;earthlink.net] has implemented a substantial > extension to Lucene which should help folks doing this sort of research. It > provides an explicit vector representation for documents. This way you can, > e.g., retrieve a number of documents, efficiently sum their vectors, then > derive a new query from the sum. This code was posted to the list a long > while back, but is now out of date. As soon as the 1.2 release is final, > and Dmitry has time, he intends to merge it into Lucene. Who has this code? Could somebody email it to me? I would highly appreciate it. Is there any attempt from Dmitry or somebody else to adapt it to Lucene 1.3? I wish you all a nice weekend, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Relevance Feedback (2)
Hello group, I would like to implement Relevance Feedback functionality for my system. >From the privious discussion in this group I know that this is not implemented in Lucene. We all know that Relevance Feedback has two fields, which are 1) Term Reweighting 2) Query Expansion I am interesting in doing both of it. My first thought was that Term Reweighting can be solved with term boosing and expansion, well, with basically generation a new query. Looking close to one of the classic term reweighting formula's (Rocchio) however reveals that I need access to the term vector of the relevant as well as the term vector of the non-relevant documents. Bringing this to Lucenen it would mean, that I need to have the score of each term in the relevant and non-relevant documents to process the reweigthing formula. Coming back to Lucene, this would mean that I need to extract Documents from the Hits object after the search. From this Documents I would need to get all terms and its scores. However, Lucene does not provide this. Only Documents can be retrieved and its scores. It does not provide access to its terms and therefore no access to Term scores. Does somebody have ideas of workaround for Term Reweighting and Query Expansion withouth using the way over Hits. Does somebody have produces workarounds and can provide it to me? Thank you very much in advance, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term weighting and Term boost
Hello Andrzej, sorry. I mistakenly run it under Java 1.2.2 which cannot work :-) Then you get Threat Exceptions... Anyway, solved now. Thank you, Karl > Karl Koch wrote: > > > Hello and thank you for this link. I think this is a very usefull tool > to > > analyse Lucene internals. > > > > > >>I realize this is not exactly the answer, but you may want to try one of > > >>the new features of Luke (http://www.getopt.org/luke), namely the query > >>result explanation. > > > > > > When I start it according to the description on your web site and select > the > > index directory I get an error message "current threat no owner"... > > > > I.e. Java WebStart, or by getting the jars and starting it from > command-line? > > > What does it mean and what do I wrong? > > Beats me... I've never seen something like that. Could you please turn > on the Java console, and see what kind of exception and where is thrown? > > -- > Best regards, > Andrzej Bialecki > > - > Software Architect, System Integration Specialist > CEN/ISSS EC Workshop, ECIMF project chair > EU FP6 E-Commerce Expert/Evaluator > - > FreeBSD developer (http://www.freebsd.org) > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term weighting and Term boost
Hello and thank you for this link. I think this is a very usefull tool to analyse Lucene internals. > I realize this is not exactly the answer, but you may want to try one of > the new features of Luke (http://www.getopt.org/luke), namely the query > result explanation. When I start it according to the description on your web site and select the index directory I get an error message "current threat no owner"... What does it mean and what do I wrong? Kind Regards, Karl > > Currently the best way to start Luke is to use Java WebStart. Then open > an already existing index, go to the Search tab, enter a query (use > "Update" button to see exactly what it is parsed into), press Search, > and then highlight one of the results and press "Explain". > > It was revealing for me to see how weights, boosts, normalizations etc. > are applied "under the hood" so to speak, especially for Fuzzy or > Phrase queries. > > After experimenting a little, you may want to consult the classes in > org.apache.lucene.search (e.g. Scorer and Similarity) to see the gory > details. > > -- > Best regards, > Andrzej Bialecki > > - > Software Architect, System Integration Specialist > CEN/ISSS EC Workshop, ECIMF project chair > EU FP6 E-Commerce Expert/Evaluator > - > FreeBSD developer (http://www.freebsd.org) > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BooleanQuery question
Hi all, why does the boolean query have a "required" and a "prohited" field (boolean value)? If something is required it cannot be forbidden and otherwise? How does this match with the Boolean model we know from theory? Are there differences between Lucene and the Boolean model in theory? Kind Regards, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Term weighting and Term boost
Hello all, I am new to the Lucene scene and have a few questions regarding the term boost physolophy: Is the term boost equal to a term weight? Example: If I boost a term with 0.2 does this mean the term has a weight of 0.2 then? If this is not the case, how is the term weight of the query calculated then? Formula? Are there parts in it which I cannot influence? Does this formular depend on the type of Query or is it independent. Maybe somebody can provide a small code example? Give the following code: TermQuery termQuery1 = new TermQuery(new Term("contents", "house")); TermQuery termQuery2 = new TermQuery(new Term("contents", "tree")); termQuery2.setBoost( ? ); BooleanQuery finalQuery = new BooleanQuery(); finalQuery.add(termQuery1, true, false); finalQuery.add(termQuery2, true, false); How can I realise that the term "tree" is double as important for search than "house"? Many questions I know but I am sure that the experts here can answer them easily. Cheers, Karl -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]