Re: Using the highlighter from the sandbox with a prefix query.
See the highlighter's package.html for a description of how query.rewrite should be used to solve this. Cheers, Mark --- lucuser4851 [EMAIL PROTECTED] wrote: Dear All, We have been using the highlighter from the lucene sandbox, which works very nicely most of the time. However when we try and use it with a prefix query (which is what you get having parsed a wild-card query), it doesn't return any highlighted sections. Has anyone else experienced this problem, or found a way around it? Thanks a lot for your suggestions!! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using the highlighter from the sandbox with a prefix query.
On Thursday 17 February 2005 08:37, lucuser4851 wrote: We have been using the highlighter from the lucene sandbox, which works very nicely most of the time. However when we try and use it with a prefix query (which is what you get having parsed a wild-card query), it doesn't return any highlighted sections. Has anyone else experienced this problem, or found a way around it? You need to call rewrite() on the query before you pass it to the highlighter. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ParrellelMultiSearcher Question
Hello, I would like to use ParrellelMultiSearcher with few RemoteSearchables. If one of the remote server is down, Can I parrellelMultiSearcher set close() and make new ParrellelMultiSearcher with other live RemoteSearchables ? Thanks. Youngho
RE: Concurrent searching re-indexing
Otis, Looking at your reply again, I have a couple of questions - IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? Many thanks Paul -Original Message- From: Paul Mellor [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:37 To: 'Lucene Users List' Subject: RE: Concurrent searching re-indexing But all write access to the index is synchronized, so that although multiple threads are creating an IndexWriter for the same directory and using it to totally recreate that index, only one thread is doing this at once. I was concerned about the safety of using an IndexSearcher to perform queries on an index that is in the process of being recreated from scratch, but I guess that if the IndexSearcher takes a snapshot of the index when it is created (and in my code this creation is synchronized with the write operations as well so that the threads wait for the write operations to finish before instantiating an IndexSearcher, and vice versa) this can't be a problem. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:30 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official
Re: Strange Index problem
On Tue, 25 Jan 2005 13:54:00 +0100, Nestel, Frank IZ/HZA-IOL [EMAIL PROTECTED] wrote: In one project we've a system which incrementally updates an index every night. This has been working fine. We've upgraded to Lucene 1.4.2 when it was there without observing a difference instantly. But now we regularly run into trouble. It seems like our index has captured a very defunc document and as long as you work around this document the index is still working, but as soon as you touch that particular document, you run into trouble: java.lang.IndexOutOfBoundsException: Index: 114, Size: 19 at java.util.ArrayList.RangeCheck (ArrayList.java:547)at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:185 ) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:92) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) I've just found a similar traceback in one of our deployed systems (using version 1.4.3): java.lang.IndexOutOfBoundsException: Index: 104, Size: 11 at java.util.ArrayList.RangeCheck(ArrayList.java:507) at java.util.ArrayList.get(ArrayList.java:324) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:185) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:92) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) After having gotten this error from the optimize() call, it is no longer possible to search: java.io.IOException: read past EOF at org.apache.lucene.store.InputStream.refill(InputStream.java:154) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:57) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.MultiReader.norms(MultiReader.java:159) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:165) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:165) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) Any ideas? -- Geir O. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Concurrent searching re-indexing
Paul Mellor writes: 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? On unix, open files are still there, even if they are deleted (that is, there is no link (filename) to the file anymore but the file's content still exists), on windows you cannot delete open files, so Lucene AFAIK (I don't use windows) postpones the deletion to a time, when the file is closed. 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? AFAIK optimising creates new files. The only problem that might occur, is opening a reader during index change but that's handled by a lock. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
reuse of TokenStream
Hi, is it thread safe to reuse the same TokenStream object for several fields of a document or does the IndexWriter try to parallelise tokenization of the fields of a single document? Similar question: Is it safe to reuse the same TokenStream object for several documents if I use IndexWriter.addDocument() in a loop? Or does addDocument only put the work into a queue where tasks are taken out for parallel indexing by several threads? Thanks, Harald. -- Harald Kirsch | [EMAIL PROTECTED] | +44 (0) 1223/49-2593 BioMed Information Extraction: http://www.ebi.ac.uk/Rebholz-srv/whatizit - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing info about the index in the index
you could use a special document in the index to do this. I was thinking about this way, but I feel this solution very ugly :) You could also keep a .properties or .xml file alongside the index. Can I store such a file inside the index directory? Will Lucene delete my file at some event? (at optimize, or whatever) Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
Hello; My manager is now totally stuck about being able to query data with * in it. Here are two queries. TermQuery(new Term(type, 203)); WildcardQuery(new Term(name, *home\**)); They are joined in a boolean query. That query gives this result when you call the toString(): +(type:203) +(name:*home\**) This looks right to me. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? Thanks, Luke - Original Message - From: Luke Shannon [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 2:44 PM Subject: Query Question Hello; Why won't this query find the document below? Query: +(type:203) +(name:*home\**) Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* I was hoping by escaping the * it would be treated as a string. What am I doing wrong? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lius
Hi, I've just release an indexing framework based on Lucene witch is named LIUS. LIUS is written in Java and it adds to Lucene many files format indexing functionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, PDF, XML, HTML, TXT, Open Office suite and JavaBeans. All the indexation process is based on a configuration file. You can visit this links for more informations about LIUS, documentation is available in English and French: www.bibl.ulaval.ca/lius/index.en.html www.sourceforge.net/projects/lius - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParrellelMultiSearcher Question
Hello, Is there any pointer how closing an index and how the server deals with index updates for using ParrellelMultiSearcher with built in RemoteSearcher ?? Need your help. Thanks, Youngho - Original Message - From: Youngho Cho [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 6:29 PM Subject: ParrellelMultiSearcher Question Hello, I would like to use ParrellelMultiSearcher with few RemoteSearchables. If one of the remote server is down, Can I parrellelMultiSearcher set close() and make new ParrellelMultiSearcher with other live RemoteSearchables ? Thanks. Youngho
Re: Query Question
On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote: My manager is now totally stuck about being able to query data with * in it. He's gonna have to wait a bit longer, you've got a slightly tricky situation on your hands WildcardQuery(new Term(name, *home\**)); The \* is the problem. WildcardQuery doesn't deal with escaping like you're trying. Your query is essentially this now: home\* Where backslash has no special meaning at all... you're literally looking for all terms that start with home followed by a backslash. Two asterisks at the end really collapse into a single one logically. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? So, again, no escaping is being done here. You're a bit stuck in this situation because * (and ?) are special to WildcardQuery, and it does no escaping. Two options I think of: - Build your own clone of WildcardQuery that does escaping - or perhaps change the wildcard characters to something you do not index and use those instead. - Replace asterisks in the terms indexed with some other non-wildcard character, then replace it on your queries as appropriate. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParrellelMultiSearcher Question
If you close a Searcher that goes through a RemoteSearchable, you'll close the remote index. I learned this by experimentation for Lucene in Action and added a warning there: http://www.lucenebook.com/search?query=RemoteSearchable+close On Feb 17, 2005, at 8:27 PM, Youngho Cho wrote: Hello, Is there any pointer how closing an index and how the server deals with index updates for using ParrellelMultiSearcher with built in RemoteSearcher ?? Need your help. Thanks, Youngho - Original Message - From: Youngho Cho [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 6:29 PM Subject: ParrellelMultiSearcher Question Hello, I would like to use ParrellelMultiSearcher with few RemoteSearchables. If one of the remote server is down, Can I parrellelMultiSearcher set close() and make new ParrellelMultiSearcher with other live RemoteSearchables ? Thanks. Youngho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
select where from query type in lucene
Hi, i've problem with my my classes using lucene. my index looks like: type | content - document | x document | x view | x view | x dbentry| x dbentry| x my question now: how can i search for content where type=document or (type=document OR type=view). actually i can do it with: (type:document OR type:entry) AND queryText as QueryString. but does exist any other better way to realize this? thx miro ___ Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to get stored fields
Hello again, i'm indexing my content as unstored fiels. now i want to get this fields matching to the query and copy it to a new index. do i have to reconstruct this content or can i copy this content as field to a new index -- Field f = hits.doc(i).getField(content); d.add(f); miro ___ Gesendet von Yahoo! Mail - Jetzt mit 250MB Speicher kostenlos - Hier anmelden: http://mail.yahoo.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
The problem of using Cyber Neko HTML Parser parse HTML files
When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--??1 root?? root? 50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The problem of using Cyber Neko HTML Parser parse HTML files
This is not an unknown character.. it is a non breaking space (unicode value 0x00A0) - Original Message - From: Jingkang Zhang [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 5:12 PM Subject: The problem of using Cyber Neko HTML Parser parse HTML files When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--?1 root root 50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParrellelMultiSearcher Question
Hi, I found my problem, The remoteServer index wasn't closed expectedly. Also after reopen the remoteServer searcher, the client side searcher also should reconnected. Thanks. Youngho - Original Message - From: Youngho Cho [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 1:38 PM Subject: Re: ParrellelMultiSearcher Question Hello Erik, Yes. I read it. And tried to close the remote index from remote server and client both. But when I search again, I received the IOException: Bad file descriptor Maybe I am wrong. Is there any demo sample ? Thanks. Youngho - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 11:47 AM Subject: Re: ParrellelMultiSearcher Question If you close a Searcher that goes through a RemoteSearchable, you'll close the remote index. I learned this by experimentation for Lucene in Action and added a warning there: http://www.lucenebook.com/search?query=RemoteSearchable+close On Feb 17, 2005, at 8:27 PM, Youngho Cho wrote: Hello, Is there any pointer how closing an index and how the server deals with index updates for using ParrellelMultiSearcher with built in RemoteSearcher ?? Need your help. Thanks, Youngho - Original Message - From: Youngho Cho [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 6:29 PM Subject: ParrellelMultiSearcher Question Hello, I would like to use ParrellelMultiSearcher with few RemoteSearchables. If one of the remote server is down, Can I parrellelMultiSearcher set close() and make new ParrellelMultiSearcher with other live RemoteSearchables ? Thanks. Youngho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: The problem of using Cyber Neko HTML Parser parse HTML files
Thank you. But how can I view correct output? If my html files using different encode method (Like : UTF-8, ISO8859-1, GBK , JIS, etc) , how can I treat it? --- Jason Polites [EMAIL PROTECTED] This is not an unknown character.. it is a non breaking space (unicode value 0x00A0) - Original Message - From: Jingkang Zhang [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 5:12 PM Subject: The problem of using Cyber Neko HTML Parser parse HTML files When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--??1 root??? root50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150??MP3?? http://music.yisou.com/ ??? http://image.yisou.com 1G?1000?? http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]