Re: indexing incrementally concurrently
Erik Hatcher wrote: On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote: If several users are saving documents on the server concurrently and during saving the index shall be updated incrementally ... do I have to make sure that it's going to be threadsave or does Lucene take care of this? Only a single IndexWriter instance at a time can be used - so you will need to coordinate things. Multiple threads can share a single IndexWriter though, so no worries there. ok. Thanks very much for the info Michi Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
incrementally indexing a million documents
I try to index around a million documents. The problem is that I run out of memory during sorting by uid when I go through the directory recursively. Well, I could add more memory, but this wouldn't really solve my problem, because at some point I will always run out of memory (e.g. 10 million documents). Is there another approach than sorting by uid? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what web crawler work best with Lucene?
Tuan Jean Tee wrote: Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. there is a crawler included within Apache Lenya http://cocoon.apache.org/lenya/ src/java/org/apache/lenya/search/crawler/* or you might try LARM http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html HTH Michi Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting by date (XML)
my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Nader S. Henein wrote: Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type ok, I guess then will take the FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain unfortunately this isn't possible. Thanks a lot for your help Michi Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Robert Koberg wrote: Ah. Great - thanks! I see you added it to the wiki. Thanks again :) I guess you mean http://wiki.apache.org/jakarta-lucene/IndexingDateFields Thanks as well Michi This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my logs, hardly anyone uses the date search. I guess I should have been doing this from the beginning, don't know why I didn't... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. HTH Michael thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: xpdf parser usage for lucene
Pinky Iyer wrote: Hi ! I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses the file system and leaves the same file with .txt extension in same dir. How can i get this as a stream and not use the file system at all. Also How do i access the summary and title info. xpdf has an option to turn the PDF into an HTML instead of txt, which allows you to use an HTMLParser for populating the fields. Concerning the extension: when you create your Lucene document, you could replace the txt extension by the pdf extension in the case of the uri field. HTH Michael Anybody who has done this before, please help! Thanks! Pinky Iyer - Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, and more - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PLAN: WebLucene -- Lucene Web interface, use XML as a lightweightprotocol.
That's very interesting. I have tried something similar by integrating Lucene into Wyona, which is a CMS based on Cocoon, and I also separated Structure from Layout. You can try it out at HTML: http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene?publication-id=allqueryString=Cocoon+Wyonafields=allfind=Search XML: http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene.xml?publication-id=allqueryString=Cocoon+Wyonafields=allfind=Search I think XooMLe also did a pretty good job: http://www.dentedreality.com.au/xoomle/search/ Maybe we find a way how to join efforts Thanks Michael Che Dong wrote: http://sourceforge.net/projects/weblucene/ WebLucene: Lucene Web interface, use XML as a lightweight protocol. Developer convert data source (text, DB, MS Word, PDF... etc) into standard xml format indexing with lucene engine, and get full text search result via HTTP, with XML format output, user can easily intergrated with JSP ASP PHP front end or use XSLT at server side transform output. Developer can intergrate lucene full text search engine with old MSSQL + ASP MySQL + PHP Oracle + JSP based web applications. MySQL \ / JSP Oracle - DB - == XML == (Lucene Index) == XML - ASP MSSQL / - PHP MS Word /\ / XHTML PDF / =XSLT= - text \ XML \_Web Lucene/ i18n issue: for Java is Unicode based, user can indexing data source(XML) in different charset into one lucene index(in unicode) and output result according to client browser support languages. GBK \ / BIG5 BIG5 - UNICODE Unicode - GB2312 SJIS - (XML) (XML) - SJIS ISO-8859-1 / \ ISO-8859-1 Che, Dong http://www.chedong.com/tech/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
no-index or index
Hi I am looking for an HTMLParser which skips text tagged by no-index or something similar. This way I could exclude for instance a global navigation section within the HTML no-index Internationalbr Businessbr Sciencebr ... /no-index It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q11) is not capable of doing something like that. Any pointers are very welcome. Thanks a lot Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: no-index or index
Erik Hatcher wrote: If you look at the contributions/ant area of the Lucene sandbox in CVS you'll see my HtmlDocument class which uses JTidy. Rather than making up some invalid HTML tag, I'd recommend you separate your navigation section with a div or span with a special class=navigation or something like that. Then use JTidy to ignore such tags that have that class. Then you get valid, clean HTML and the ability to filter it for indexing. Well, I haven't found out how to use JTidy to ignore such tags that have such a class. So I just added some code to your class HtmlDocument within the getBodyText method: if(child.getNodeName().equals(span)){ org.w3c.dom.Attr attribute=((Element)child).getAttributeNode(class); if(attribute != null){ if(attribute.getValue().equals(lucene-no-index)){ System.out.println(HtmlDocument.getBodyText(): ignore span!); break; } } System.out.println(HtmlDocument.getBodyText(): accept span!); } This way text will be ignored within span class=lucene-no-index.../span It's not perfect, but it's working very well for the moment. Two remarks: 1) I noticed that demo/HTMLDocument (resp. demo/html/HTMLParser) sets: contents= title + body and your class HtmlDocument contents=body 2) I got two Javadoc warnings, because @return was empty within HtmlDocument (getDocument() and Document()) Thanks very much for your help Michael Erik On Thursday, January 30, 2003, at 04:56 AM, Michael Wechner wrote: Hi I am looking for an HTMLParser which skips text tagged by no-index or something similar. This way I could exclude for instance a global navigation section within the HTML no-index Internationalbr Businessbr Sciencebr ... /no-index It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi-bin/faq/ faqmanager.cgi?file=chapter.indexingtoc=faq#q11) is not capable of doing something like that. Any pointers are very welcome. Thanks a lot Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: no-index or index
Ronnie Kolehmainen wrote: Michael, the HtmlDocument class supports ignoring tags, ie all text inside specified tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags) method. Remember to also include script and style in this array along with your custom tag names. I am not able to find the method setIgnoreTags() (I have updated my jakarta-lucene and jakarta-lucene-sandbox). Or would that have been within the attachment? I guess the attachments are skiped by the mailing list server. I am now using Erik's code from sandbox. Anyway, thanks a lot for your help Michael Hope this is any help for you. See below for the message from an old thread. /Ronnie Hi I am looking for an HTMLParser which skips text tagged by no-index or something similar. This way I could exclude for instance a global navigation section within the HTML no-index Internationalbr Businessbr Sciencebr ... /no-index It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.inde xingtoc=faq#q11) is not capable of doing something like that. Any pointers are very welcome. Thanks a lot Michael Message sent on dec 9 2002: HI, these are the classes i use. I only use them to extract the text stuff, so they don't have methods for getting document title and such. However text extraction has worked fine for me. The HtmlParser main method takes a file path as argument and outputs the contents to a file named html.txt - useful when testing. /Ronnie -Ursprungligt meddelande- Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Skickat: den 7 december 2002 17:12 Till: Lucene Users List Amne: Re: SV: Indexing HTML I have had good experiences with nekoHTML parser. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... /Ronnie Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: \u0178 (376), after : : adding save/d00320/f01527.html Parse Aborted: Encountered = at line 83, column 48. Was expecting one of: ArgName ... TagEnd ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: no-index or index
Kelvin Tan wrote: My suggestion would be to modify HTMLParser to do the job. Don't think it's very difficult. I'm unaware of any existing HTML Parsers which support that functionality... Maybe Erik wants to include an improved version of my code snippet into CVS. I guess I am not the only one wanting to exclude certain parts from an HTML page ;-) All the best Michael Regards, Kelvin The book giving manifesto - http://how.to/sharethisbook On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said: Hi I am looking for an HTMLParser which skips text tagged by no-index or something similar. This way I could exclude for instance a global navigation section within the HTML no-index Internationalbr Businessbr Sciencebr ... /no-index It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi- bin/faq/faqmanager.cgi?file=chapter.indexingtoc=faq#q11) is not capable of doing something like that. Any pointers are very welcome. Thanks a lot Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: no-index or index
Erik Hatcher wrote: On Thursday, January 30, 2003, at 06:59 PM, Michael Wechner wrote: snip/ 2) I got two Javadoc warnings, because @return was empty within HtmlDocument (getDocument() and Document()) picky picky! :) But thanks - I'll correct those too. sorry for that, but ant resp. javadoc was picky :-) I'm not ready to commit my changes - I'll do so in a few weeks when I get some refactoring done on IndexTask. No problem Thanks Michael Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: no-index or index
Erik Hatcher wrote: On Thursday, January 30, 2003, at 07:07 PM, Michael Wechner wrote: Maybe Erik wants to include an improved version of my code snippet into CVS. Only if it can be made generic somehow - but that might be a bit tricky to implement depending on how crazy we wanted to get with it. The HtmlDocument class is really meant to be just an example of how to use the Ant index task I wrote along with the FileExtensionDocumentHandler that uses it. So its original purpose was not to be a robust HTML document indexer, but an example piece of a larger puzzle. sure, no problem. Actually I think it's good to have small demo code and larger industrial strength code. I guess I am not the only one wanting to exclude certain parts from an HTML page ;-) I've seen this request come up in the recent past, in fact. And its a perfectly reasonable one, especially if you are in charge of the HTML. yeah, I am not sure if there is a standard way to do this. I just know from an Atomz demo that they are using something like this. It would be nice if there would be a standard tag for this, or at least that the Open Source Search Engines projects could agree on one. To have it configurable would also be nice of course, but I think it wouldn't be necessary for the beginning. Thanks Michael Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing other documents (.pdf et .doc)
Friaa Nafaa wrote: Hello,I use Lucene with Tomcat and I can now index and search all html documents. But I would like to index other documents such us pdf or Word (.doc), I hope that sameone can help me ! Concerning PDF: Before indexing you should extract the text from the PDF and save it as .txt (Then you can index the .txt, but reference the PDF uri). To do this have a look at http://www.foolabs.com/xpdf/download.html or http://www.pdfbox.org/ These links are listed at http://jakarta.apache.org/lucene/docs/contributions.html Also take a look at the FAQ HTH Michael ___ Join Excite! - http://www.excite.com The most personalized portal on the Web! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Score: Lucene 1.2 versus 1.3-dev1
Hi I started to deploy Lucene 1.3-dev1 from CVS very recently and noticed that the score is kind of different. In the case of Lucene1.2 I received scores such as for instance 3.45345234 * 10e-1 In the case of Lucene1.3-dev1 I am receiving scores such as for instance 3.23232131 *10e-8 Is this correct or have I to change something within my Lucene implementation? Thanks a lot in advance Michael -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Score: Lucene 1.2 versus 1.3-dev1
Eric Isakson wrote: Did you rebuild your index? No, of course not ;-) Thanks a lot for the pointer Michael from CHANGES.TXT: 12. Added support for boosting the score of documents and fields via the new methods Document.setBoost(float) and Field.setBoost(float). Note: This changes the encoding of an indexed value. Indexes should be re-created from scratch in order for search scores to be correct. With the new code and an old index, searches will yield very large scores for shorter fields, and very small scores for longer fields. Once the index is re-created, scores will be as before. (cutting) -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED]] Sent: Monday, December 16, 2002 4:34 PM To: [EMAIL PROTECTED] Subject: Score: Lucene 1.2 versus 1.3-dev1 Hi I started to deploy Lucene 1.3-dev1 from CVS very recently and noticed that the score is kind of different. In the case of Lucene1.2 I received scores such as for instance 3.45345234 * 10e-1 In the case of Lucene1.3-dev1 I am receiving scores such as for instance 3.23232131 *10e-8 Is this correct or have I to change something within my Lucene implementation? Thanks a lot in advance Michael -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]