RE: How to index a Word document
I've been using the POI-scratchpad package with a slightly altered (only interested in the text stuff) WordDocument class for a while. Results show that approx 50% of the Word documents are parsable with this package. This is not very good, but imo better than nothing, and yet the best(?) Java solution. /Ronnie -Ursprungligt meddelande- Från: Nellai [mailto:[EMAIL PROTECTED]] Skickat: den 31 januari 2003 04:50 Till: [EMAIL PROTECTED] Ämne: How to index a Word document Hi! Can anyone tell me how to include word document for indexing. Is there any parser available for that. Thanks in advance Nellai... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: no-index or index
..wonder what happened with the attachements...here they go again. -Ursprungligt meddelande- Fran: Ronnie Kolehmainen [mailto:[EMAIL PROTECTED]] Skickat: den 30 januari 2003 14:15 Till: [EMAIL PROTECTED] Amne: Re: no-index or index Michael, the HtmlDocument class supports ignoring tags, ie all text inside specified tag names is ignored. Look at the setIgnoreTags(String [] ignoredtags) method. Remember to also include script and style in this array along with your custom tag names. Hope this is any help for you. See below for the message from an old thread. /Ronnie Hi I am looking for an HTMLParser which skips text tagged by no-index or something similar. This way I could exclude for instance a global navigation section within the HTML no-index Internationalbr Businessbr Sciencebr ... /no-index It seems that the current demo/HTMLParser (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=ch apter.inde xingtoc=faq#q11) is not capable of doing something like that. Any pointers are very welcome. Thanks a lot Michael Message sent on dec 9 2002: HI, these are the classes i use. I only use them to extract the text stuff, so they don't have methods for getting document title and such. However text extraction has worked fine for me. The HtmlParser main method takes a file path as argument and outputs the contents to a file named html.txt - useful when testing. /Ronnie -Ursprungligt meddelande- Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Skickat: den 7 december 2002 17:12 Till: Lucene Users List Amne: Re: SV: Indexing HTML I have had good experiences with nekoHTML parser. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... /Ronnie Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: \u0178 (376), after : : adding save/d00320/f01527.html Parse Aborted: Encountered = at line 83, column 48. Was expecting one of: ArgName ... TagEnd ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SV: SV: Indexing HTML
HI, these are the classes i use. I only use them to extract the text stuff, so they don't have methods for getting document title and such. However text extraction has worked fine for me. The HtmlParser main method takes a file path as argument and outputs the contents to a file named html.txt - useful when testing. /Ronnie -Ursprungligt meddelande- Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Skickat: den 7 december 2002 17:12 Till: Lucene Users List Amne: Re: SV: Indexing HTML I have had good experiences with nekoHTML parser. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... /Ronnie Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: \u0178 (376), after : : adding save/d00320/f01527.html Parse Aborted: Encountered = at line 83, column 48. Was expecting one of: ArgName ... TagEnd ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] HtmlDocument.java Description: Binary data HtmlParser.java Description: Binary data -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
SV: Indexing HTML
Dear Leo, I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... If you look at the source of that document you can see there is a Javascript with this problematic line: document.write('s' + 'cript src=http://ad.webcz.cz/adwebcz/adscript.asp?a=10t=0b=0x=468y=60nocache =' + nIndex + ''); ^ Looks to me the HTMLParser does _not_ treat/handle the script tags correct, i e ignore everything until /script. If you check stdout there should be error messages from the ParserThread class like the one above. I tried parsing the same document with another html parser class without any problems. Maybe try replacing the HTMLParser class used by HTMLDocument with your own? Or edit the HTMLParser.jj file if you have javacc knowledge. /Ronnie -Ursprungligt meddelande- Fran: Leo Galambos [mailto:[EMAIL PROTECTED]] Skickat: den 3 december 2002 20:32 Till: [EMAIL PROTECTED] Amne: Indexing HTML I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but Lucene often falls to never-ending loop. I've analyzed my data, so I know what file(s) sent Lucene down. I don't see anything special in the file(s), so I think, that it can go throught parser to main Lucene routines (and then the problem could be in Merger). Could you help me, please? One of the problematic files: http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt My program (based on Lucene demo): http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java Thank you very much. -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]