Re: Zip Files
Thanks Ernesto. The issue I'm working with now (this is more lack of experience than anything) is getting an input I can index. All my indexing classes (doc, pdf, xml, ppt) take a File object as a parameter and return a Lucene Document containing all the fields I need. I'm struggling with how I can work with an array of bytes instead of a Java File. It would be easier to unzip the zip to a temp directory, parse the files and than delete the directory. But this would greatly slow indexing and use up disk space. Luke - Original Message - From: "Ernesto De Santis" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, March 01, 2005 10:48 AM Subject: Re: Zip Files > Hello > > first, you need a parser for each file type: pdf, txt, word, etc. > and use a java api to iterate zip content, see: > > http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html > > use getNextEntry() method > > little example: > > ZipInputStream zis = new ZipInputStream(fileInputStream); > ZipEntry zipEntry; > while(zipEntry = zis.getNextEntry() != null){ > //use zipEntry to get name, etc. > //get properly parser for current entry > //use parser with zis (ZipInputStream) > } > > good luck > Ernesto > > Luke Shannon escribió: > > >Hello; > > > >Anyone have an ideas on how to index the contents within zip files? > > > >Thanks, > > > >Luke > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > -- > Ernesto De Santis - Colaborativa.net > Córdoba 1147 Piso 6 Oficinas 3 y 4 > (S2000AWO) Rosario, SF, Argentina. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zip Files
Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Filtering Question
Hello; I'm trying to create a Filter that only retrieves documents with a path field containing a sub string(s). I can get the Filter to work if the BooleanQuery below (used to create the Filter) contains only TermQueries (this requires me to know the exact path). But not if it contains Wildcard? Here is the code to create the filter: //if the paths parameter is null we don't use the filter boolean useFilter = false; BooleanQuery filterParams = new BooleanQuery(); if (paths != null) { useFilter = true; Trace.DEBUG("The query will have a filter with " + paths.size() + " terms."); Iterator path = paths.iterator(); while (path.hasNext()) { String strPath = "*" + (String)path.next() + "*"; Trace.DEBUG(strPath + " is one of the params"); filterParams.add(new WildcardQuery(new Term("path", strPath)), false, false); } } Trace.DEBUG("The filter is created using this: " + filterParams); Filter pathFilter = new QueryFilter(filterParams); Trace.DEBUG("The filter is " + pathFilter.toString()); When useFilter is true, the search is executed with pathFilter as the second parameter -> hits = searcher.search(query, pathFilter); Here is a small output, without the filter this query returns all the documents in the index. With it, 6 should be coming back. *testing* is one of the params The filter is created using this: path:*testing* The filter is QueryFilter(path:*testing*) The query: olfaithfull:*stillhere* returned 0 Why won't this work? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiField Queries without the QueryParser
Responding to this posts. Please disreguard. Sorry. - Original Message - From: "Luke Shannon" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, February 22, 2005 5:16 PM Subject: MultiField Queries without the QueryParser > Hello; > > The book meantions the MultiFieldQueryParser as one way of dealing with > multifield queries. Can someone point me in the direction of other ways? > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiField Queries without the QueryParser
Hello; The book meantions the MultiFieldQueryParser as one way of dealing with multifield queries. Can someone point me in the direction of other ways? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optional Terms in a single query
Hi Tod; Thanks for your help. I was able to do what you said but in a much uglier way using a Boolean Query and adding Wildcard Queries. The end result looks like this: The query: +(type:138) +((-name:*tim* -name:*bill* -name:*harry* +olfaithfull:stillhere)) But this one works as expected. Thanks! Luke - Original Message - From: "Todd VanderVeen" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 21, 2005 6:26 PM Subject: Re: Optional Terms in a single query > Luke Shannon wrote: > > >The API I'm working with combines a series of queries into one larger one > >using a boolean query. > > > >Queries on the same field get OR's into one big query. All remaining queries > >are AND'd with this big one. > > > >Working with in this system I have: > > > >arg = (mario luigi bobby joe) //i do have control of how this list is > >created > > > >I pass this to the QueryParser: > > > >Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer()); > >Query query2 = QueryParser.parse("stillhere", "olfaithfull", new > >StandardAnalyzer()); > >BooleanQuery typeNegativeSearch = new BooleanQuery(); > >typeNegativeSearch.add(query1, false, true); > >typeNegativeSearch.add(query2, true, false); > > > >This is half the query. > > > >It gets AND'd with the other half, to create what you see below: > > > >+(type:181) +((-(name:tim name:harry name:bill) +olfaithfull:stillhere)) > > > >What I am having trouble with is getting the QueryParser to create > >this: -name:(tim bill harry) > > > >I feel like this is something simple, but for some reason I can't figure it > >out. > > > >Thanks, > > > >Luke > > > > > > > Is the API something you control? > > Lets call the other half of you query query3. To avoid the extra nesting > you need to do the composition in a single boolean query. > > Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer()); > Query query2 = QueryParser.parse("stillhere", "olfaithfull", new StandardAnalyzer()); > Query query3 = > > BooleanQuery finalQuery = new BooleanQuery(); > finalQuery.add(query1, false, true); > finalQuery.add(query2, true, false); > finalQuery.add(query3, true, false); > > Cheers, > Todd VanderVeen > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optional Terms in a single query
The API I'm working with combines a series of queries into one larger one using a boolean query. Queries on the same field get OR's into one big query. All remaining queries are AND'd with this big one. Working with in this system I have: arg = (mario luigi bobby joe) //i do have control of how this list is created I pass this to the QueryParser: Query query1 = QueryParser.parse(arg, "name", new StandardAnalyzer()); Query query2 = QueryParser.parse("stillhere", "olfaithfull", new StandardAnalyzer()); BooleanQuery typeNegativeSearch = new BooleanQuery(); typeNegativeSearch.add(query1, false, true); typeNegativeSearch.add(query2, true, false); This is half the query. It gets AND'd with the other half, to create what you see below: +(type:181) +((-(name:tim name:harry name:bill) +olfaithfull:stillhere)) What I am having trouble with is getting the QueryParser to create this: -name:(tim bill harry) I feel like this is something simple, but for some reason I can't figure it out. Thanks, Luke - Original Message - From: "Todd VanderVeen" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 21, 2005 5:33 PM Subject: Re: Optional Terms in a single query > Luke Shannon wrote: > > >Hi; > > > >I'm trying to create a query that look for a field containing type:181 and > >name doesn't contain tim, bill or harry. > > > >+(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere)) > >+(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere)) > >+(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere)) > >+(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere)) > > > >I would really think to do this all in one Query. Is this even possible? > > > >Thanks, > > > >Luke > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > All all the queries listed attempts at the same things? > > I'm guessing you want this: > > +type:181 -name:(tim bill harry) +oldfaith:stillHere > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optional Terms in a single query
Sorry about the typos. What I would like is a document with a type field = 181, olfaithfull=stillHere and a name field not containing tim, bill or harry. Thanks, Luke - Original Message - From: "Paul Elschot" <[EMAIL PROTECTED]> To: Sent: Monday, February 21, 2005 5:31 PM Subject: Re: Optional Terms in a single query > On Monday 21 February 2005 23:23, Luke Shannon wrote: > > Hi; > > > > I'm trying to create a query that look for a field containing type:181 and > > name doesn't contain tim, bill or harry. > > type: 181 -(name: tim name:bill name:harry) > > > +(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere)) > > stillHere is normally lowercased before searching. Is that ok? > > > +(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere)) > > +(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere)) > > typo? olfaithfull > > > +(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere)) > > typo? (type:1 81) > > > I would really think to do this all in one Query. Is this even possible? > > How would you want to combine the results? > > Regards, > Paul Elschot > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Optional Terms in a single query
Hi; I'm trying to create a query that look for a field containing type:181 and name doesn't contain tim, bill or harry. +(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere)) +(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere)) +(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere)) +(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere)) I would really think to do this all in one Query. Is this even possible? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Handling Synonyms
Hello; Does anyone see a problem with the following approach? For synonyms, rather than putting them in the index, I put the original term and all the synonyms in the query. Every time I create a query, I check if the term has any synonyms. If it does, I create Boolean Query OR'ing one Query object for each synonym. So if I have a synoym list: red = colour, primary, stop And someone wants to search the desc field for the red, I would end up with something like: ( (desc:*red*) (desc:*colout*) (desc:*stop*) ). Now the synonyms would'nt be in the index, the Query would account for all the possible synonym terms. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
More Analyzer Question
I have created an Analyzer that I think should just be converting to lower case and add synonyms in the index (it is at the end of the email). The problem is, after running it I get one more result than I was expecting (Document 1 should not be there): Running testNameCombination1, expecting: 1 result The query: +(type:138) +(name:mario*) returned 2 Start Listing documents: Document: 0 contains: Name: Text Desc: Text Document: 1 contains: Name: Text Desc: Text End Listing documents Those same 2 documents in Luke look like this: Document 0 Text Text Document 1 Text Text That looks correct to me. The query shouldn't match Document 1. The analzyer used on this field is below and is applied like so: //set the default PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new SynonymAnalyzer(new FBSynonymEngine())); //the analyzer for the name field (only converts to lower case and adds synonyms analyzer.addAnalyzer("name", new KeywordSynonymAnalyzer(new FBSynonymEngine())); Any help would be appreciated. Thanks, Luke import org.apache.lucene.analysis.*; import java.io.Reader; public class KeywordSynonymAnalyzer extends Analyzer { private SynonymEngine engine; public KeywordSynonymAnalyzer(SynonymEngine engine) { this.engine = engine; } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new SynonymFilter(new LowerCaseTokenizer(reader), engine); return result; } } Luke Shannon | Software Developer FutureBrand Toronto 207 Queen's Quay, Suite 400 Toronto, ON, M5J 1A7 416 642 7935 (office) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzing Advise
This is exactly what I was looking for. Thanks - Original Message - From: "Steven Rowe" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 18, 2005 4:41 PM Subject: Re: Analyzing Advise > Luke Shannon wrote: > > But now that I'm looking at the API I'm not sure I can specifiy a > > different analyzer when creating a field. > > Is PerFieldAnalyzerWrapper what you're looking for? > > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Pe rFieldAnalyzerWrapper.html> > > Steve > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Analyzing Advise
Hi; I'm having a situation where my synonyms weren't working for a particular field. When I looked at the indexing I noticed it was a Keyword, thus not tokenized. The problem is when I switched that field to Text (now tokenized with my SynonymAnalyzer) a bunch of query queires broke that where testing for starting with or or ending with a specific string. My SynonymAnalyzer wraps a StanardAnalyzer, which acts as I would like for all fields but this one. I don't want to change the behavior for all tokenizing. Only this one field's data must remain unaltered. I was hoping to make a Analyzer, that just applied the Synonyms, that I could just use on the one field when I added it to the Document. But now that I'm looking at the API I'm not sure I can specifiy a different analyzer when creating a field. Any tips? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanties
Nice work Eric. I would like to spend more time playing with it, but I saw a few things I really liked. When a specific query turns up no results you prompt the client to preform a free form search. Less sauvy search users will benefit from this strategy. I also like the display of information when you select a result. Everything is at your finger tips without clutter. I did get this error when a name search failed to turn up results and I clicked 'help' in the free form search row (the second row). Here is my browser info: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Below are the details from the error: Page 'help-freeform.html' not found in application namespace. Stack Trace: a.. org.apache.tapestry.resolver.PageSpecificationResolver.resolve(PageSpecifica tionResolver.java:120) b.. org.apache.tapestry.pageload.PageSource.getPage(PageSource.java:144) c.. org.apache.tapestry.engine.RequestCycle.getPage(RequestCycle.java:195) d.. org.apache.tapestry.engine.PageService.service(PageService.java:73) e.. org.apache.tapestry.engine.AbstractEngine.service(AbstractEngine.java:872) f.. org.apache.tapestry.ApplicationServlet.doService(ApplicationServlet.java:197 ) g.. org.apache.tapestry.ApplicationServlet.doGet(ApplicationServlet.java:158) h.. javax.servlet.http.HttpServlet.service(HttpServlet.java:740) i.. javax.servlet.http.HttpServlet.service(HttpServlet.java:853) j.. org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:247) k.. org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:193) l.. org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:256) m.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) n.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) o.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) p.. org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:191) q.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) r.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) s.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) t.. org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422) u.. org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180 ) v.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) w.. org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve. java:171) x.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) y.. org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163 ) z.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:641) aa.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) ab.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) ac.. org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :174) ad.. org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok eNext(StandardPipeline.java:643) ae.. org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480) af.. org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995) ag.. org.apache.ajp.tomcat4.Ajp13Processor.process(Ajp13Processor.java:457) ah.. org.apache.ajp.tomcat4.Ajp13Processor.run(Ajp13Processor.java:576) ai.. java.lang.Thread.run(Thread.java:534) Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene User" Sent: Friday, February 18, 2005 2:46 PM Subject: Lucene in the Humanties > It's about time I actually did something real with Lucene :) > > I have been working with the Applied Research in Patacriticism group at > the University of Virginia for a few months and finally ready to > present what I've been doing. The primary focus of my group is working > with the Rossetti Archive - poems, artwork, interpretations, > collections, and so on of Dante Gabriel Rossetti. I was initially > brought on to build a collection and exhibit system, though I got > detoured a bit as I got involved in applying Lucene to the archive to > replace their existing search system. The existing system used an old > version of Tamino with XPath queries. Tamino is not at fault here, at > least not entirely, because our data is in a very complicated set of > XML files with a lot of non-normalized and legacy metadata - getting at > things via XPath is challenging and practically impossible in many > cases. > > My work is now presentable at > > http://www.rossettiarchive.org/rose > > (rose is for ROsetti SEarch
Re: Query Question
Thanks Erik. Option 2 sounds like the path of least resistance. Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 17, 2005 9:05 PM Subject: Re: Query Question > On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote: > > My manager is now totally stuck about being able to query data with * > > in it. > > He's gonna have to wait a bit longer, you've got a slightly tricky > situation on your hands > > > WildcardQuery(new Term("name", "*home\**")); > > The \* is the problem. WildcardQuery doesn't deal with escaping like > you're trying. Your query is essentially this now: > > home\* > > Where backslash has no special meaning at all... you're literally > looking for all terms that start with home followed by a backslash. > Two asterisks at the end really collapse into a single one logically. > > > Any theories as to why the it would not match: > > > > Document (relevant fields): > > Keyword > > Keyword > > > > Is the \ escaping both * characters? > > So, again, no escaping is being done here. You're a bit stuck in this > situation because * (and ?) are special to WildcardQuery, and it does > no escaping. Two options I think of: > > - Build your own clone of WildcardQuery that does escaping - or > perhaps change the wildcard characters to something you do not index > and use those instead. > > - Replace asterisks in the terms indexed with some other non-wildcard > character, then replace it on your queries as appropriate. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
Hello; My manager is now totally stuck about being able to query data with * in it. Here are two queries. TermQuery(new Term("type", "203")); WildcardQuery(new Term("name", "*home\**")); They are joined in a boolean query. That query gives this result when you call the toString(): +(type:203) +(name:*home\**) This looks right to me. Any theories as to why the it would not match: Document (relevant fields): Keyword Keyword Is the \ escaping both * characters? Thanks, Luke ----- Original Message - From: "Luke Shannon" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 17, 2005 2:44 PM Subject: Query Question > Hello; > > Why won't this query find the document below? > > Query: > +(type:203) +(name:*home\**) > > Document (relevant fields): > Keyword > Keyword > > I was hoping by escaping the * it would be treated as a string. What am I > doing wrong? > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
That is a query toString(). I created the Query using a Wildcard Query object. Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 17, 2005 3:00 PM Subject: Re: Query Question > > On Feb 17, 2005, at 2:44 PM, Luke Shannon wrote: > > > Hello; > > > > Why won't this query find the document below? > > > > Query: > > +(type:203) +(name:*home\**) > > Is that what the query toString is? Or is that what you handed to > QueryParser? > > Depending on your analyzer, 203 may go away. QueryParser doesn't > support leading asterisks, so "*home" would fail to parse. > > > Document (relevant fields): > > Keyword > > Keyword > > > > I was hoping by escaping the * it would be treated as a string. What > > am I > > doing wrong? > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query Question
Hello; Why won't this query find the document below? Query: +(type:203) +(name:*home\**) Document (relevant fields): Keyword Keyword I was hoping by escaping the * it would be treated as a string. What am I doing wrong? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searches Contain Special Characters
Hi All; How could I handle doing a wildcard search on the input *mario? Basically I would be interested in finding all the Documents containing *mario Here is an example of such a Query generated: +(type:138) +(name:**mario*) How can I let Lucene know that the star closest to Mario on the left is to be treated as a string, not a matching character? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Match
Thanks Eric. This is indeed the way to go. - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 11, 2005 10:25 AM Subject: Re: Negative Match > > On Feb 11, 2005, at 9:52 AM, Luke Shannon wrote: > > > Hey Erik; > > > > The problem with that approach is I get document that don't have a > > kcfileupload field. This makes sense because these documents don't > > match the > > prohibited > > clause, but doesn't fit with the requirements of the system. > > Ok, so instead of using the dummy field with a single dummy value, use > a dummy field to list the field names. > Field.Keyword("fields","kcfileupload"), but only for the documents that > should have it, of course. Then use a query like (using QueryParser > syntax, but do it with the API as you have since QueryParser doesn't > support leading wildcards): > > +fields:kcfileupload -kcfileupload:*jpg* > > Again, your approach is risky with term expansion. Get more than 1,024 > unique kcfileupload values and you'll see! > > Erik > > > > > > What I like best about this approach is it doesn't require a filter. > > The > > system I integrate with is presently designed to accept a query > > object. I > > wasn't looking forward to having to add the possibility that queries > > might > > require filters. I may have to still do this, but for now I would like > > to > > try this and see how it goes. > > > > Thanks, > > > > Luke > > > > - Original Message - > > From: "Erik Hatcher" <[EMAIL PROTECTED]> > > To: "Lucene Users List" > > Sent: Thursday, February 10, 2005 7:23 PM > > Subject: Re: Negative Match > > > > > >> > >> On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote: > >> > >>> I think I found a pretty good way to do a negative match. > >>> > >>> In this query I am looking for all the Documents that have a > >>> kcfileupload > >>> field with any value except for jpg. > >>> > >>> Query negativeMatch = new WildcardQuery(new > >>> Term("kcfileupload", > >>> "*jpg*")); > >>> BooleanQuery typeNegAll = new BooleanQuery(); > >>> Query allResults = new WildcardQuery(new Term("kcfileupload", > >>> "*")); > >>> IndexSearcher searcher = new IndexSearcher(fsDir); > >>> BooleanClause clause = new BooleanClause(negativeMatch, > >>> false, > >>> true); > >>> typeNegAll.add(allResults, true, false); > >>> typeNegAll.add(clause); > >>> Hits hits = searcher.search(typeNegAll); > >>> > >>> With the little testing I have done this *seems* to work. Does anyone > >>> see a > >>> problem with this approach? > >> > >> Sure do you realize what WildcardQuery does under the covers? It > >> literally expands to a BooleanQuery for all terms that match the > >> pattern. There is an adjustable limit built-in of 1,024 clauses to > >> BooleanQuery. You obviously have not hit that limit ... yet! > >> > >> You're better off using the advice offered on this thread > >> previously create a single dummy field with a fixed value for all > >> documents. Combine a TermQuery for that dummy value with a prohibited > >> clause like y our negativeMatch above. > >> > >> Erik > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Match
Hey Erik; The problem with that approach is I get document that don't have a kcfileupload field. This makes sense because these documents don't match the prohibited clause, but doesn't fit with the requirements of the system. What I like best about this approach is it doesn't require a filter. The system I integrate with is presently designed to accept a query object. I wasn't looking forward to having to add the possibility that queries might require filters. I may have to still do this, but for now I would like to try this and see how it goes. Thanks, Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 10, 2005 7:23 PM Subject: Re: Negative Match > > On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote: > > > I think I found a pretty good way to do a negative match. > > > > In this query I am looking for all the Documents that have a > > kcfileupload > > field with any value except for jpg. > > > > Query negativeMatch = new WildcardQuery(new > > Term("kcfileupload", > > "*jpg*")); > > BooleanQuery typeNegAll = new BooleanQuery(); > > Query allResults = new WildcardQuery(new Term("kcfileupload", > > "*")); > > IndexSearcher searcher = new IndexSearcher(fsDir); > > BooleanClause clause = new BooleanClause(negativeMatch, false, > > true); > > typeNegAll.add(allResults, true, false); > > typeNegAll.add(clause); > > Hits hits = searcher.search(typeNegAll); > > > > With the little testing I have done this *seems* to work. Does anyone > > see a > > problem with this approach? > > Sure do you realize what WildcardQuery does under the covers? It > literally expands to a BooleanQuery for all terms that match the > pattern. There is an adjustable limit built-in of 1,024 clauses to > BooleanQuery. You obviously have not hit that limit ... yet! > > You're better off using the advice offered on this thread > previously create a single dummy field with a fixed value for all > documents. Combine a TermQuery for that dummy value with a prohibited > clause like y our negativeMatch above. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Negative Match
I think I found a pretty good way to do a negative match. In this query I am looking for all the Documents that have a kcfileupload field with any value except for jpg. Query negativeMatch = new WildcardQuery(new Term("kcfileupload", "*jpg*")); BooleanQuery typeNegAll = new BooleanQuery(); Query allResults = new WildcardQuery(new Term("kcfileupload", "*")); IndexSearcher searcher = new IndexSearcher(fsDir); BooleanClause clause = new BooleanClause(negativeMatch, false, true); typeNegAll.add(allResults, true, false); typeNegAll.add(clause); Hits hits = searcher.search(typeNegAll); With the little testing I have done this *seems* to work. Does anyone see a problem with this approach? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? Or should I be looking at Query.combine()? Thanks, Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, February 08, 2005 12:02 PM Subject: Re: Problem searching Field.Keyword field Kelvin - I respectfully disagree - could you elaborate on why this is not an appropriate use of Field.Keyword? If the category is "How To", Field.Text would split this (depending on the Analyzer) into "how" and "to". If the user is selecting a category from a drop-down, though, you shouldn't be using QueryParser on it, but instead aggregating a TermQuery("category", "How To") into a BooleanQuery with the rest of it. The rest may be other API created clauses and likely a piece from QueryParser. Erik On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote: > As I posted previously, Field.Keyword is appropriate in only certain > situations. For your use-case, I believe Field.Text is more suitable. > > k > > On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote: >> This may or may not be correct, but I am indexing it as a keyword >> because I provide a (required) radio button on the add screen for >> the user to determine which category the document should be >> assigned. Then in the search, provide a dropdown that can be used >> in the advanced search so that they can search only for a specific >> category of documents (like HowTo, Troubleshooting, etc). >> >> -Original Message- >> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday, >> February 08, 2005 9:32 AM To: Lucene Users List >> Subject: RE: Problem searching Field.Keyword field >> >> Mike, is there a reason why you're indexing "category" as keyword >> not text? >> >> k >> >> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote: >> >>> Thanks for the quick response. >>> >>> Sorry for my lack of understanding, but I am learning! Won't the >>> query parser still handle this query? My limited understanding >>> was that the search call provides the 'all' field as default >>> field for query terms in the case where fields aren't specified. >>> Using the current code, searches like author:Mike" and >>> title:Lucene work fine. >>> >>> -Original Message- >>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: >>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: >>> Re: Problem searching Field.Keyword field >>> >>> You're using the query parser with the standard analyser. You >>> should construct a term query manually instead. >>> >>> >>> -- >>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. >>> >>> -- >>> -- - To unsubscribe, e-mail: lucene-user- >>> [EMAIL PROTECTED] For additional commands, e-mail: >>> [EMAIL PROTECTED] >>> >>> >>> -- >>> -- - To unsubscribe, e-mail: lucene-user- >>> [EMAIL PROTECTED] For additional commands, e-mail: >>> [EMAIL PROTECTED] >> >> >> >> - To unsubscribe, e-mail: lucene-user- >> [EMAIL PROTECTED] For additional commands, e-mail: >> [EMAIL PROTECTED] >> >> >> >> - To unsubscribe, e-mail: lucene-user- >> [EMAIL PROTECTED] For additional commands, e-mail: >> [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? Or should I be looking at Query.combine()? Thanks, Luke - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, February 08, 2005 12:02 PM Subject: Re: Problem searching Field.Keyword field Kelvin - I respectfully disagree - could you elaborate on why this is not an appropriate use of Field.Keyword? If the category is "How To", Field.Text would split this (depending on the Analyzer) into "how" and "to". If the user is selecting a category from a drop-down, though, you shouldn't be using QueryParser on it, but instead aggregating a TermQuery("category", "How To") into a BooleanQuery with the rest of it. The rest may be other API created clauses and likely a piece from QueryParser. Erik On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote: > As I posted previously, Field.Keyword is appropriate in only certain > situations. For your use-case, I believe Field.Text is more suitable. > > k > > On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote: >> This may or may not be correct, but I am indexing it as a keyword >> because I provide a (required) radio button on the add screen for >> the user to determine which category the document should be >> assigned. Then in the search, provide a dropdown that can be used >> in the advanced search so that they can search only for a specific >> category of documents (like HowTo, Troubleshooting, etc). >> >> -Original Message- >> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday, >> February 08, 2005 9:32 AM To: Lucene Users List >> Subject: RE: Problem searching Field.Keyword field >> >> Mike, is there a reason why you're indexing "category" as keyword >> not text? >> >> k >> >> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote: >> >>> Thanks for the quick response. >>> >>> Sorry for my lack of understanding, but I am learning! Won't the >>> query parser still handle this query? My limited understanding >>> was that the search call provides the 'all' field as default >>> field for query terms in the case where fields aren't specified. >>> Using the current code, searches like author:Mike" and >>> title:Lucene work fine. >>> >>> -Original Message- >>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: >>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: >>> Re: Problem searching Field.Keyword field >>> >>> You're using the query parser with the standard analyser. You >>> should construct a term query manually instead. >>> >>> >>> -- >>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. >>> >>> -- >>> -- - To unsubscribe, e-mail: lucene-user- >>> [EMAIL PROTECTED] For additional commands, e-mail: >>> [EMAIL PROTECTED] >>> >>> >>> -- >>> -- - To unsubscribe, e-mail: lucene-user- >>> [EMAIL PROTECTED] For additional commands, e-mail: >>> [EMAIL PROTECTED] >> >> >> >> - To unsubscribe, e-mail: lucene-user- >> [EMAIL PROTECTED] For additional commands, e-mail: >> [EMAIL PROTECTED] >> >> >> >> - To unsubscribe, e-mail: lucene-user- >> [EMAIL PROTECTED] For additional commands, e-mail: >> [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
I implemented this concept for my ends with query. It works very well! - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 04, 2005 9:37 PM Subject: Re: Starts With x and Ends With x Queries > > : Also keep in mind that QueryParser only allows a trailing asterisk, > : creating a PrefixQuery. However, if you use a WildcardQuery directly, > : you can use an asterisk as the starting character (at the risk of > : performance). > > On the issue of "ends with" wildcard queries, I wanted to throw out and > idea that i've seen used to deal with matches like this in other systems. > I've never acctually tried this with Lucene, but I've seen it used > effectively with other systems where the goal is to "sort" strings by the > least significant (ie: right most) characters first. I think it could > apply nicely to people who have compelling needs for efficent 'ends with' > queries. > > > > Imagine you have a field call name, which you can already do efficient > prefix matching on using the PrefixQuery class. Your docs and query may > look something like this... > >D1> name:"Adam Smith" age:13 state:CA ... >D2> name:"Joe Bob" age:42 state:WA ... >D3> name:"John Adams" age:35 state:NV ... >D3> name:"Sue Smith" age:33 state:CA ... > > ...and your queries may look something like... > >Query q1 = new PrefixQuery(new Term("name","J*")); >Query q2 = new PrefixQuery(new Term("name","Sue*")); > > If you want to start doing suffix queries (ie: all names ending with > "s", or all names ending with "Smith") one approach would be to use > WildcarQuery, which as Erik mentioned, will allow you to use a quey Term > that starts with a "*". ie... > >Query q3 = new WildcardQuery(new Term("name","*s")); >Query q4 = new WildcardQuery(new Term("name","*Smith")); > > (NOTE: Erik says you can do this, but the docs for WildcardQuery say you > can't I'll assume the docs are wrong and Erik is correct.) > > The problem is that this is horrendously inefficient. In order to find > the docs that contain Terms which match your suffix, WildcardQuery must > first identify what all of those Terms are, by iterating over every Term > in your index to see if they match the suffix. This is much slower then a > PrefixQuery, or even a WildcardQuery that has just 1 initial character > before a "*" (ie: "s*foobar"), because it can then seek to directly to the > first Term that starts with that character, and also stop iterating as > soon as it encounters a Term that no longer begins with that character. > > Which leads me to my point: if you denormalize your data so that you store > both the Term you want, and the *reverse* of the term you want, then a > Suffix query is just a Prefix query on a reversed field -- by sacrificing > space, you can get all the speed efficiencies of a PrefixQuery when doing > a SuffixQuery... > >D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ... >D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ... >D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ... >D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ... > >Query q1 = new PrefixQuery(new Term("name","J*")); >Query q2 = new PrefixQuery(new Term("name","Sue*")); >Query q3 = new PrefixQuery(new Term("rname","s*")); >Query q4 = new PrefixQuery(new Term("rname","htimS*")); > > > (If anyone sees a flaw in my theory, please chime in) > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RangeQuery With Date
Bingo. Thanks! Luke - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 07, 2005 5:10 PM Subject: Re: RangeQuery With Date > : Your dates need to be stored in lexicographical order for the RangeQuery > : to work. > : > : Index them using this date format: MMDD. > : > : Also, I'm not sure if the QueryParser can handle range queries with only > : one end point. You may need to create this query programmatically. > > and when creating them progromaticaly, you need to use the exact same > format they were indexed in. Assuming I've corectly guess what your > indexing code looks like, you probably want... > > Query query = new RangeQuery(null, new Term("modified", "2004"), false); > > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RangeQuery With Date
Hi; I am working on a set of queries that allow you to find modification dates before, after and equal to a given date. Here are some of the before queries I have been playing with. I want a query that pull up dates modified before Nov 11 2004: Query query = new RangeQuery(null, new Term("modified", "11/11/04"), false); This one doesn't work. It turns up all the documents in the index. Query query = QueryParser.parse("modified:[1/1/00 TO 11/11/04]", "subject", new StandardAnalyzer()); This works but I don't like having to specify the begin date like this. Query query = QueryParser.parse("modified:[null TO 11/11/04]", "subject", new StandardAnalyzer()); This throws an exception. How are other doing a Query like this? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Starts With x and Ends With x Queries
Hello; I have these two documents: Text Keyword Text Text Text Text Text Text Text Text Text Text Text Text Keyword Keyword Text Text Text Text Text Text Text Text Text Text I would like to be able to match a name fields that starts with testing (specifically) and those that end with it. I thought the below code would parse to a Prefix Query that would satisfy my starting requirment (maybe I don't understand what this query is for). But this matches both. Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer()); Has anyone done this before? Any tips? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x (but still has the field)
Hello; I think Chris's approach might be helpfull, but I can't seems to get it to work. So since I running out of time and I still need to figure out "starts with" and "ends with" queries, I have implemented a hacky solution to getting all documents with a kcfileupload field present that does not contain jpg: query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); query2 = QueryParser.parse("stillhere", "olfaithfull", new StandardAnalyzer());//each document contains this BooleanQuery typeNegativeSearch = new BooleanQuery(); typeNegativeSearch.add(query1, false, true); typeNegativeSearch.add(query2, true, false); What gets returned are all the documents without a kcfileupload = jpg. This includes documents that don't even have a kcfileupload. When I go through the results before displaying I check to make sure there is a "kcfileupload" field. This is not a good solution, and I hope to replace it soon. If anyone has ideas please let me know. Luke - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 04, 2005 3:03 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x Another approach... You can make a Filter that is the inverse of the output from another filter, which means you can make a QueryFilter on the search, then wrap it in your inverse Filter. you can't execute a query on a filter without having a Query object, but you can just apply the Filter directly to an IndexReader yourself, and get back a BitSet containing the docIds of everydocument that does not contain your term. something like this should work... class NotFilter extends Filter { private Filter wraped; public NotFilter(Filter w) { wraped = w; } public BitSet bits(IndexReader r) { BitSet b = wraped.bits(r); b.flip(0,b.size()); return b; } } ... BitSet results = (new NotFilter (new QueryFilter (new TermQuery(new Term("f","x").bits(reader); : Date: Thu, 3 Feb 2005 19:51:36 +0100 : From: Kelvin Tan <[EMAIL PROTECTED]> : Reply-To: Lucene Users List : To: Lucene Users List : Subject: Re: Parsing The Query: Every document that doesn't have a field : containing x : : Alternatively, add a dummy field-value to all documents, like doc.add(Field.Keyword("foo", "bar")) : : Waste of space, but allows you to perform negated queries. : : On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote: : >> Negating a term must be combined with at least one nonnegated : >> term to return documents; in other words, it isn't possible to : >> use a query like NOT term to find all documents that don't : >> contain a term. : >> : >> So does that mean the above example wouldn't work? : >> : > Exactly. You cannot search for "-kcfileupload:jpg", you need at : > least one clause that actually _includes_ documents. : > : > Do you by chance have a field with known contents? If so, you could : > misuse that one and include it in your query (perhaps by doing : > range or wildcard/prefix search). If not, try IndexReader.terms() : > for building a Query yourself, then use that one for search. : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Hi Chris; So the result would contain all documents that don't have field f containing x? What I need to figure out how to do is return all documents that have a field f, but does not contain x. Thanks for your post. Luke - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 04, 2005 3:03 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x Another approach... You can make a Filter that is the inverse of the output from another filter, which means you can make a QueryFilter on the search, then wrap it in your inverse Filter. you can't execute a query on a filter without having a Query object, but you can just apply the Filter directly to an IndexReader yourself, and get back a BitSet containing the docIds of everydocument that does not contain your term. something like this should work... class NotFilter extends Filter { private Filter wraped; public NotFilter(Filter w) { wraped = w; } public BitSet bits(IndexReader r) { BitSet b = wraped.bits(r); b.flip(0,b.size()); return b; } } ... BitSet results = (new NotFilter (new QueryFilter (new TermQuery(new Term("f","x").bits(reader); : Date: Thu, 3 Feb 2005 19:51:36 +0100 : From: Kelvin Tan <[EMAIL PROTECTED]> : Reply-To: Lucene Users List : To: Lucene Users List : Subject: Re: Parsing The Query: Every document that doesn't have a field : containing x : : Alternatively, add a dummy field-value to all documents, like doc.add(Field.Keyword("foo", "bar")) : : Waste of space, but allows you to perform negated queries. : : On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote: : >> Negating a term must be combined with at least one nonnegated : >> term to return documents; in other words, it isn't possible to : >> use a query like NOT term to find all documents that don't : >> contain a term. : >> : >> So does that mean the above example wouldn't work? : >> : > Exactly. You cannot search for "-kcfileupload:jpg", you need at : > least one clause that actually _includes_ documents. : > : > Do you by chance have a field with known contents? If so, you could : > misuse that one and include it in your query (perhaps by doing : > range or wildcard/prefix search). If not, try IndexReader.terms() : > for building a Query yourself, then use that one for search. : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Lucene Users List" Sent: Friday, February 04, 2005 2:12 AM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x I think you may can use a filter to get right result! See examlples below package lia.advsearching; import junit.framework.TestCase; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.QueryFilter; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.RAMDirectory; public class SecurityFilterTest extends TestCase { private RAMDirectory directory; protected void setUp() throws Exception { directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); // Elwood Document document = new Document(); document.add(Field.Keyword("owner", "elwood")); document.add(Field.Text("keywords", "elwoods sensitive info")); writer.addDocument(document); // Jake document = new Document(); document.add(Field.Keyword("owner", "jake")); document.add(Field.Text("keywords", "jakes sensitive info")); writer.addDocument(document); writer.close(); } public void testSecurityFilter() throws Exception { TermQuery query = new TermQuery(new Term("keywords", "info")); IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(query); assertEquals("Both documents match", 2, hits.length()); QueryFilter jakeFilter = new QueryFilter( new TermQuery(new Term("owner", "jake"))); hits = searcher.search(query, jakeFilter); assertEquals(1, hits.length()); assertEquals("elwood is safe", "jakes sensitive info", hits.doc(0).get("keywords")); } } On Thu, 3 Feb 2005 13:04:50 -0500, Luke Shannon <[EMAIL PROTECTED]> wrote: > Hello; > > I have a query that finds document that contain fields with a specific > value. > > query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); > > This works well. > > I would like a query that find documents containing all kcfileupload fields > that don't contain jpg. > > The example I found in the book that seems to relate shows me how to find > documents without a specific term: > > QueryParser parser = new QueryParser("contents", analyzer); > parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); > > But than it says: > > Negating a term must be combined with at least one nonnegated term to return > documents; in other words, it isn't possible to use a query like NOT term to > find all documents that don't contain a term. > > So does that mean the above example wouldn't work? > > The API says: > > a plus (+) or a minus (-) sign, indicating that the clause is required or > prohibited respectively; > > I have been playing around with using the minus character without much luck. > > Can someone give point me in the right direction to figure this out? > > Thanks, > > Luke > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- æäåäæäå - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Very Nice. Thanks! Luke - Original Message - From: "åç" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 04, 2005 2:12 AM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x I think you may can use a filter to get right result! See examlples below package lia.advsearching; import junit.framework.TestCase; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.QueryFilter; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.RAMDirectory; public class SecurityFilterTest extends TestCase { private RAMDirectory directory; protected void setUp() throws Exception { directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); // Elwood Document document = new Document(); document.add(Field.Keyword("owner", "elwood")); document.add(Field.Text("keywords", "elwoods sensitive info")); writer.addDocument(document); // Jake document = new Document(); document.add(Field.Keyword("owner", "jake")); document.add(Field.Text("keywords", "jakes sensitive info")); writer.addDocument(document); writer.close(); } public void testSecurityFilter() throws Exception { TermQuery query = new TermQuery(new Term("keywords", "info")); IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(query); assertEquals("Both documents match", 2, hits.length()); QueryFilter jakeFilter = new QueryFilter( new TermQuery(new Term("owner", "jake"))); hits = searcher.search(query, jakeFilter); assertEquals(1, hits.length()); assertEquals("elwood is safe", "jakes sensitive info", hits.doc(0).get("keywords")); } } On Thu, 3 Feb 2005 13:04:50 -0500, Luke Shannon <[EMAIL PROTECTED]> wrote: > Hello; > > I have a query that finds document that contain fields with a specific > value. > > query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); > > This works well. > > I would like a query that find documents containing all kcfileupload fields > that don't contain jpg. > > The example I found in the book that seems to relate shows me how to find > documents without a specific term: > > QueryParser parser = new QueryParser("contents", analyzer); > parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); > > But than it says: > > Negating a term must be combined with at least one nonnegated term to return > documents; in other words, it isn't possible to use a query like NOT term to > find all documents that don't contain a term. > > So does that mean the above example wouldn't work? > > The API says: > > a plus (+) or a minus (-) sign, indicating that the clause is required or > prohibited respectively; > > I have been playing around with using the minus character without much luck. > > Can someone give point me in the right direction to figure this out? > > Thanks, > > Luke > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- æäåäæäå - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Bingo! Nice catch. That was it. Made everything lower case when I set the field. Works great now. Thanks! Luke - Original Message - From: "Kauler, Leto S" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 6:48 PM Subject: RE: Parsing The Query: Every document that doesn't have a field containing x Because you are build from QueryParser rather than a TermQuery, all search terms in the query are being lowercased by StandardAnalyzer. So your query of "olFaithFull:stillhere" requires that there is an exact index term of "stillhere" in that field. It depends on how you built the index (index and stored fields are different), but I would check on that. Also maybe try out TermQuery and see if that does anything for you. > -Original Message- > From: Luke Shannon [mailto:[EMAIL PROTECTED] > Sent: Friday, 4 February 2005 10:47 AM > To: Lucene Users List > Subject: Re: Parsing The Query: Every document that doesn't > have a field containing x > > > "stillHere" > > Capital H. > > - Original Message - > From: "Kauler, Leto S" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Thursday, February 03, 2005 6:40 PM > Subject: RE: Parsing The Query: Every document that doesn't > have a field containing x > > > First thing that jumps out is case-sensitivity. Does your > olFaithFull field contain "stillHere" or "stillhere"? > > --Leto > > > > -Original Message- > > From: Luke Shannon [mailto:[EMAIL PROTECTED] > > This works: > > > > query1 = QueryParser.parse("jpg", "kcfileupload", new > > StandardAnalyzer()); query2 = QueryParser.parse("stillHere", > > "olFaithFull", new StandardAnalyzer()); BooleanQuery > > typeNegativeSearch = new BooleanQuery(); > > typeNegativeSearch.add(query1, false, false); > > typeNegativeSearch.add(query2, false, false); > > > > It returns 9 results. And in string form is: kcfileupload:jpg > > olFaithFull:stillhere > > > > But this: > > > > query1 = QueryParser.parse("jpg", "kcfileupload", new > > StandardAnalyzer()); > > query2 = QueryParser.parse("stillHere", > "olFaithFull", new > > StandardAnalyzer()); > > BooleanQuery typeNegativeSearch = new BooleanQuery(); > > typeNegativeSearch.add(query1, true, false); > > typeNegativeSearch.add(query2, true, false); > > > > Reutrns 0 results and is in string form : +kcfileupload:jpg > > +olFaithFull:stillhere > > > > If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc > > containing a olFaithFull:stillHere. Why would > > +kcfileupload:jpg +olFaithFull:stillhere return no results? > > > > Thanks, > > > > Luke CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
"stillHere" Capital H. - Original Message - From: "Kauler, Leto S" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 6:40 PM Subject: RE: Parsing The Query: Every document that doesn't have a field containing x First thing that jumps out is case-sensitivity. Does your olFaithFull field contain "stillHere" or "stillhere"? --Leto > -Original Message- > From: Luke Shannon [mailto:[EMAIL PROTECTED] > This works: > > query1 = QueryParser.parse("jpg", "kcfileupload", new > StandardAnalyzer()); query2 = QueryParser.parse("stillHere", > "olFaithFull", new StandardAnalyzer()); BooleanQuery > typeNegativeSearch = new BooleanQuery(); > typeNegativeSearch.add(query1, false, false); > typeNegativeSearch.add(query2, false, false); > > It returns 9 results. And in string form is: kcfileupload:jpg > olFaithFull:stillhere > > But this: > > query1 = QueryParser.parse("jpg", "kcfileupload", new > StandardAnalyzer()); > query2 = QueryParser.parse("stillHere", > "olFaithFull", new StandardAnalyzer()); > BooleanQuery typeNegativeSearch = new BooleanQuery(); > typeNegativeSearch.add(query1, true, false); > typeNegativeSearch.add(query2, true, false); > > Reutrns 0 results and is in string form : +kcfileupload:jpg > +olFaithFull:stillhere > > If I do the query kcfileupload:jpg in Luke I get 9 docs, each > doc containing a olFaithFull:stillHere. Why would > +kcfileupload:jpg +olFaithFull:stillhere return no results? > > Thanks, > > Luke CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
This works: query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); query2 = QueryParser.parse("stillHere", "olFaithFull", new StandardAnalyzer()); BooleanQuery typeNegativeSearch = new BooleanQuery(); typeNegativeSearch.add(query1, false, false); typeNegativeSearch.add(query2, false, false); It returns 9 results. And in string form is: kcfileupload:jpg olFaithFull:stillhere But this: query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); query2 = QueryParser.parse("stillHere", "olFaithFull", new StandardAnalyzer()); BooleanQuery typeNegativeSearch = new BooleanQuery(); typeNegativeSearch.add(query1, true, false); typeNegativeSearch.add(query2, true, false); Reutrns 0 results and is in string form : +kcfileupload:jpg +olFaithFull:stillhere If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc containing a olFaithFull:stillHere. Why would +kcfileupload:jpg +olFaithFull:stillhere return no results? Thanks, Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 4:55 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x > > Yes. There should be 119 with stillHere, > > You have double-checked that, haven't you? :) > > > and if I run a query in Luke on > > kcfileupload = ppt, it returns one result. I am thinking I should at least > > get this result back with: -kcfileupload:jpg +olFaithFull:stillhere? > > You really should. > > -- > Maik Schreiber * http://www.blizzy.de <-- Get GMail invites here! > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
I did, I have ran both queries in Luke. kcfileupload:ppt returns 1 olFaithfull:stillhere returns 119 Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 4:55 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x > > Yes. There should be 119 with stillHere, > > You have double-checked that, haven't you? :) > > > and if I run a query in Luke on > > kcfileupload = ppt, it returns one result. I am thinking I should at least > > get this result back with: -kcfileupload:jpg +olFaithFull:stillhere? > > You really should. > > -- > Maik Schreiber * http://www.blizzy.de <-- Get GMail invites here! > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Yes. There should be 119 with stillHere, and if I run a query in Luke on kcfileupload = ppt, it returns one result. I am thinking I should at least get this result back with: -kcfileupload:jpg +olFaithFull:stillhere? Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 4:27 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x > > -kcfileupload:jpg +olFaithFull:stillhere > > > > This looks right to me. Why the 0 results? > > Looks good to me, too. You sure all your documents have > olFaithFull:stillhere and there is at least a document with kcfileupload not > being "jpg"? > > -- > Maik Schreiber * http://www.blizzy.de <-- Get GMail invites here! > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Hello, Still working on the same query, here is the code I am currently working with. I am thinking this should bring up all the documents that have olFaithFull=stillHere and kcfileupload!=jpg (so anything else) query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); query2 = QueryParser.parse("stillHere", "olFaithFull", new StandardAnalyzer()); BooleanQuery typeNegativeSearch = new BooleanQuery(); typeNegativeSearch.add(query1, false, true); typeNegativeSearch.add(query2, true, false); There toString() on the query is: -kcfileupload:jpg +olFaithFull:stillhere This looks right to me. Why the 0 results? Thanks, Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 1:19 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x > > Negating a term must be combined with at least one nonnegated term to return > > documents; in other words, it isn't possible to use a query like NOT term to > > find all documents that don't contain a term. > > > > So does that mean the above example wouldn't work? > > Exactly. You cannot search for "-kcfileupload:jpg", you need at least one > clause that actually _includes_ documents. > > Do you by chance have a field with known contents? If so, you could misuse > that one and include it in your query (perhaps by doing range or > wildcard/prefix search). If not, try IndexReader.terms() for building a > Query yourself, then use that one for search. > > -- > Maik Schreiber * http://www.blizzy.de > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Parsing The Query: Every document that doesn't have a field containing x
Ok. I have added the following to every document: doc.add(Field.UnIndexed("olFaithfull", "stillHere")); The plan is a query that says: olFaithull = stillHere and kcfileupload!=jpg. I have been experimenting with the MultiFieldQueryParser, this is not working out for me. From a syntax how is this done? Does someone have an example of a query similar to the one I am trying? Thanks, Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 1:19 PM Subject: Re: Parsing The Query: Every document that doesn't have a field containing x > > Negating a term must be combined with at least one nonnegated term to return > > documents; in other words, it isn't possible to use a query like NOT term to > > find all documents that don't contain a term. > > > > So does that mean the above example wouldn't work? > > Exactly. You cannot search for "-kcfileupload:jpg", you need at least one > clause that actually _includes_ documents. > > Do you by chance have a field with known contents? If so, you could misuse > that one and include it in your query (perhaps by doing range or > wildcard/prefix search). If not, try IndexReader.terms() for building a > Query yourself, then use that one for search. > > -- > Maik Schreiber * http://www.blizzy.de > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms Not Showing In The Index
Thanks! I can wait for the release. Luke - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 2:53 PM Subject: Re: Synonyms Not Showing In The Index > Andrzej Bialecki wrote: > > Luke Shannon wrote: > > > >> Hello; > >> > >> It seems my Synonym analyzer is working (based on some successful > >> queries). > >> But I can't see the synonyms in the index using Luke. Is this correct? > >> > > > > Did you use the combined JAR to run? It contains an oldish version of > > Lucene... Other than that, I'm not sure - if you can't find the reason > > you could send me a small test index... > > > > > > Got the bug. Your index is ok, and your synonym analyzer works as > expected. The Doc #16, field "name" has the content "luigi|mario test", > where tokens "luigi" and "mario" occupy the same position. > > This was a deficiency with the current version of Luke, where if you > press "Reconstruct" it tries to reconstruct only unstored fields, but > shows you the stored fields verbatim (without actually checking how > their content was tokenized, and what tokens ended up in the index). > > This is fixed in the new (yet unreleased) version of Luke. This new > version restores all fields (no matter if they are stored or only > indexed), and then displays both the stored content, and the restored > tokenized content. There was also a bug in GrowableStringsArray - the > values of tokens with the same position were being overwritten instead > of appended. This is also fixed now. > > You should expect a new release within a week or two. If you can't wait, > let me know and I'll send you the patches. > > -- > Best regards, > Andrzej Bialecki > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Parsing The Query: Every document that doesn't have a field containing x
Hello; I have a query that finds document that contain fields with a specific value. query1 = QueryParser.parse("jpg", "kcfileupload", new StandardAnalyzer()); This works well. I would like a query that find documents containing all kcfileupload fields that don't contain jpg. The example I found in the book that seems to relate shows me how to find documents without a specific term: QueryParser parser = new QueryParser("contents", analyzer); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); But than it says: Negating a term must be combined with at least one nonnegated term to return documents; in other words, it isn't possible to use a query like NOT term to find all documents that don't contain a term. So does that mean the above example wouldn't work? The API says: a plus (+) or a minus (-) sign, indicating that the clause is required or prohibited respectively; I have been playing around with using the minus character without much luck. Can someone give point me in the right direction to figure this out? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock failure recovery
The indexing process is totally synchronized in our system. Thus if an Indexing thread starts up and the index exists, but is locked, I know this to be the only indexing processing running so the lock must be from a process that got stopped before it could finish. So right before I begin writing to the index I have this check: //if we have gotten to here that this is the only index running. //the index should not be locked. if it is, the lock is "stale" //and must be released before we can continue try { if (index.exists() && IndexReader.isLocked(indexFileLocation)) { Trace.ERROR("INDEX INFO: Had to clear a stale index lock"); IndexReader.unlock(FSDirectory.getDirectory(index, false)); } } catch (IOException e3) { Trace.ERROR("INDEX ERROR: Was unable to clear a stale index lock: " + e3); } Luke - Original Message - From: "Claes Holmerson" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 03, 2005 12:02 PM Subject: Lock failure recovery > Hello > > A commit.lock can get left by a process that dies in the middle of > reading the index, for example because of an OutOfMemoryError. How can I > handle such a left lock gracefully the next time the process runs? > Checking if there is a lock is straight forward - but how can I be sure > that it is not just a current lock created by another thread? The only > methods I find to deal with the lock is IndexReader.isLocked() and > IndexReader.unlock(). I would like to know the lock age - if it is older > than a certain age then I can remove it. How do other people deal with > left over locks? > > Claes > -- > > Claes Holmerson > Polopoly - Cultivating the information garden > Kungsgatan 88, SE-112 27 Stockholm, SWEDEN > Direct: +46 8 506 782 59 > Mobile: +46 704 47 82 59 > Fax: +46 8 506 782 51 > [EMAIL PROTECTED], http://www.polopoly.com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Synonyms Not Showing In The Index
Hello; It seems my Synonym analyzer is working (based on some successful queries). But I can't see the synonyms in the index using Luke. Is this correct? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser Help
Actually now that I am looking at it, I think I am already accomplishing it. I wanted all the documents with Mario in either field to show up. There are two, but one has them in both fields in the Document. This is correct. Thanks for the help. It would have taken me a while to catch that. Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Wednesday, February 02, 2005 6:26 PM Subject: Re: QueryParser Help > > Not sure how to handle this yet, I still don't know enough about > > QueryParsing. > > What is it you're trying to accomplish? > > -- > Maik Schreiber * http://www.blizzy.de > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser Help
This is it. Thank Maik. One of the docs had the result in both name and desc. Not sure how to handle this yet, I still don't know enough about QueryParsing. Luke - Original Message - From: "Maik Schreiber" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Wednesday, February 02, 2005 6:15 PM Subject: Re: QueryParser Help > > But, I know "name" contains 2 documents, I also know "desc" contains one. > > This may be a dumb question but why does Hits not contain pointers to 3 > > results (1 from name, 2 from desc)? > > Your search is an OR search, which is why you get a union of search hits. > Consider these documents (which I think you have in your index): > > Document 1: > - name=mario > - desc=mario > > Document 2: > - name=mario > - desc=foo > > > - Searching for "mario" in field "name" would return 2 hits. > - Searching for "mario" in field "desc" would return 1 hit. > - Searching for "mario" in both fields would return 2 hits (which is what > you're seeing). > > -- > Maik Schreiber * http://www.blizzy.de > > GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713 > Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser Help
Hello; Getting squinted with Query Parsing. I have a questions: Query query = MultiFieldQueryParser .parse("mario", new String[] { "name", "desc" }, new int[] { MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD }, new StandardAnalyzer()); IndexSearcher searcher = new IndexSearcher(fsDir); Hits hits = searcher.search(query); System.out.printing("Keywords : " + hits.length()+ " " + query.toString()); assertEquals(2, hits.length()); This test is successful. But, I know "name" contains 2 documents, I also know "desc" contains one. This may be a dumb question but why does Hits not contain pointers to 3 results (1 from name, 2 from desc)? Thanks Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
In our application I use regular expressions to strip all tags in one situation and specific ones in another situation. Here is sample code for both: This strips all html 4.0 tags except , , , , , , : html_source = Pattern.compile("", Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll(""); When I want to strip anything in a tag I use the following pattern with the code above: String strPattern1 = "<\\s?(.|\n)*?\\s?>"; HTH Luke - Original Message - From: "sergiu gordea" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > can cause you a lot of problems in the future. > > It's up to you to use it > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing much more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the size. > >> > >> You can use 3 lines of code with a good regular expresion to eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > > Sergiu > > Karl Koch wrote: > > > > > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short and > > > > > >>simple > >> > >> > >(KISS)) which allows to remove all HTML tags from HTML content? HTML > > > > > >>3.2 > >> > >> > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other structure > > > > > >>but > >> > >> > >need a facility to clean up HTML into its normal underlying content > > > > > > > > > before > > > > > >indexing that content as a whole. > > > >Karl > > > > > > > > > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo parser > >> > >> > >> > >> > does > > > > > >>simple mapping of HTML files into Lucene Documents; it does not give > >> > >> > >>you > >> > >> > >> > >> > >> > >> > a > > > > > >>parse tree for the HTML doc. CyberNeko is an extension of Xerces > >> > >> > >>(uses > >> > >> > >> > >> > >> > >> > >> > >> > >the > > > > > > > > > > > > > >>same API; will likely become part of Xerces), and so maps an HTML > >> > >> > >> > >> > document > > > > > >>into a full DOM that you can manipulate easily for a wide range of > >>purposes. I haven't used JTidy at an API level and so don't know it > >> > >> > >>as > >> > >> > >> > >> > >> > >> > >> > >> > >well -- > > > > > > > > > > > > > >>based on its UI, it appears to be focused primarily on HTML validation > >> > >> > >> > >> > and > > > > > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > >> > >> > >> > >> > beyond > > > > > >>indexing them in Lucene, and really like it. It has been robust for > >> > >> > >>me > >> > >> > >> > >> > >> > >> > so > > > > > >>far. > >> > >>Chuck > >> > >> > >> > >>>-Original Message- > >>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >>>Sent: Tuesday, February 01, 2005 1:15 AM > >>>To: lucene-user@jakarta.apache.org > >>>Subject: which HTML parser is better?
Combining Documents
Hello; I have a situation where I need to combine the fields returned from one document to an existing document. Is there something in the API for this that I'm missing or is this the best way: //add the fields contained in the PDF document to the existing doc Document Document attachedDoc = LucenePDFDocument.getDocument(attached); Enumeration docFields = attachedDoc.fields(); while (docFields.hasMoreElements()) { doc.add((Field)docFields.nextElement()); } Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get document count?
Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: "Jim Lynch" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, February 01, 2005 11:28 AM Subject: How to get document count? > I've indexed a large set of documents and think that something may have > gone wrong somewhere in the middle. Is there a way I can display the > count of documents in the index? > > Thanks, > Jim. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting Questions
Thanks Otis. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, January 27, 2005 12:11 PM Subject: Re: Boosting Questions > Luke, > > Boosting is only one of the factors involved in Document/Query scoring. > Assuming that by applying your boosts to Document A or a single field > of Document A increases the total score enough, yes, that Document A > may have the highest score. But just because you boost a single > Document and not others, it does not mean it will emerge at the top. > You should check out the Explanation class, which can dump all scoring > factors in text or HTML format. > > Otis > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > Hi All; > > > > I just want to make sure I have the right idea about boosting. > > > > So if I boost a document (Document A) after I index it (lets say a > > score of > > 2.0) Lucene will now consider this document relativly more important > > than > > other documents in the index with a boost factor less than 2.0. This > > boost > > factor will also be applied to all the fields in the Document A. > > Therefore, > > if I do a TermQuery on a field that all my documents share ("title"), > > in the > > returned Hits (assuming Document A was among the return documents), > > Document > > A will score higher than other documents with a lower boost factor > > because > > the "title" field in A would have been boosted with all its other > > fields. > > Correct? > > > > Now if at indexing time I decided to boost a particular field, lets > > say > > "address" in Document A (this is a field which all documents have) > > the boost > > factor is only applied to the "address" field of Document A. Nothing > > else is > > boosted by this operation. This means if a TermQuery on the "address" > > field > > returns Document A along with a collection of other documents, > > Document A > > will score higher than the others because of boosting. Correct? > > > > Thanks, > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boosting Questions
Hi All; I just want to make sure I have the right idea about boosting. So if I boost a document (Document A) after I index it (lets say a score of 2.0) Lucene will now consider this document relativly more important than other documents in the index with a boost factor less than 2.0. This boost factor will also be applied to all the fields in the Document A. Therefore, if I do a TermQuery on a field that all my documents share ("title"), in the returned Hits (assuming Document A was among the return documents), Document A will score higher than other documents with a lower boost factor because the "title" field in A would have been boosted with all its other fields. Correct? Now if at indexing time I decided to boost a particular field, lets say "address" in Document A (this is a field which all documents have) the boost factor is only applied to the "address" field of Document A. Nothing else is boosted by this operation. This means if a TermQuery on the "address" field returns Document A along with a collection of other documents, Document A will score higher than the others because of boosting. Correct? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting Into Search
Thanks Otis. Haven't seen it in either store (at least the ones in downtown Toronto I usually shop at). Their website says it ships in 24 hrs. It was cheaper on Amazon.ca so I went that route for my printed version. Luke - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Wednesday, January 26, 2005 1:13 PM Subject: Re: Getting Into Search > Hi Luke, > > That's not hard with RangeQuery (supported by QueryParser), take a look > at this: > http://www.lucenebook.com/search?query=date+range > > The grayed-out text has the section name and page number, so you can > quickly locate this stuff in your ebook. > > Otis > P.S. > Do you know if Indigo/Chapters has Lucene in Action on their book > shelves yet? > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > Hello; > > > > My lucene application has been performing well in our company's CMS > > application. The plan now is too offer "advanced searching". > > > > I just bought the eBook version of Lucene in Action to help with my > > research > > (it is taking Amazon for ever to ship the printed version to Canada). > > > > The book looks great and will certainly deepen my understanding. But > > I am > > suffering a bit of information over load. > > > > I was hoping I could post the rough requirments I was given this > > morning and > > perhaps some more experienced Luceners could help direct my research > > (this > > can even be pointing me to relevant sections of the book). > > > > 1. Documents in the system contain the following fields, > > ModificationDate, > > CreationDate. A query is required that allows users to search for > > documents > > created/modified on a certain date or within a certain date range. > > > > 2. Documents in the system also contains fields: Title, Path. A query > > is > > required that allows users to search for Titles or Path starting > > with, > > ending with, containing (this is all the system currently does) or > > matching > > specific term(s). > > > > Later today I will get more specific requirments. For now I am > > looking > > through Analysis section of the eBook for ideas on how to handle > > this. Any > > tips anyone can give would be appreciated. > > > > Thanks, > > > > Luke > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Getting Into Search
Hello; My lucene application has been performing well in our company's CMS application. The plan now is too offer "advanced searching". I just bought the eBook version of Lucene in Action to help with my research (it is taking Amazon for ever to ship the printed version to Canada). The book looks great and will certainly deepen my understanding. But I am suffering a bit of information over load. I was hoping I could post the rough requirments I was given this morning and perhaps some more experienced Luceners could help direct my research (this can even be pointing me to relevant sections of the book). 1. Documents in the system contain the following fields, ModificationDate, CreationDate. A query is required that allows users to search for documents created/modified on a certain date or within a certain date range. 2. Documents in the system also contains fields: Title, Path. A query is required that allows users to search for Titles or Path starting with, ending with, containing (this is all the system currently does) or matching specific term(s). Later today I will get more specific requirments. For now I am looking through Analysis section of the eBook for ideas on how to handle this. Any tips anyone can give would be appreciated. Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Thanks Ben. I new none related issues now. For the time being I will be using path. Once I get a chance I will try this on the command line as you have recommended. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 21, 2005 1:05 PM Subject: Re: FOP Generated PDF and PDFBox > > > Ya, when calling LucenePDFDocument.getDocument( File ) then it should be > the same as the path. > > This is the code that the class uses to set those fields. > > document.add( Field.UnIndexed("path", file.getPath() ) ); > document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR, > '/'))); > > I have no idea why an FOP PDF would be any different than another PDF. > > You can also run it from the command line, this is just for debugging > purposes like this. > > java org.pdfbox.searchengine.lucene.LucenePDFDocument > > and it should print out the fields of the lucene Document object. Is the > url there and is it correct? > > Ben > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > That is correct. No difference with how other PDF are handled. > > > > I am looking at the index in Luke now. The FOP generated documents have a > > path but no URL? I would guess that these would be the same? > > > > Thanks for the speedy reply. > > > > Luke > > > > > > - Original Message - > > From: "Ben Litchfield" <[EMAIL PROTECTED]> > > To: "Lucene Users List" > > Sent: Friday, January 21, 2005 12:34 PM > > Subject: Re: FOP Generated PDF and PDFBox > > > > > > > > > > > > > Are you indexing the FOP PDF's differently than other PDF documents? > > > > > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > > > method? > > > > > > Ben > > > > > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > > > > > Hello; > > > > > > > > Our CMS now allows users to create PDF documents (uses FOP) and than > > search > > > > them. > > > > > > > > I seem to be able to index these documents ok. But when I am generating > > the > > > > results to display I get a Null Pointer Exception while trying to use a > > > > variable that should contain the url keyword for one of these documents > > in > > > > the index: > > > > > > > > Document doc = hits.doc(i); > > > > String path = doc.get("url"); > > > > > > > > Path contains null. > > > > > > > > The interesting thing is this only happens with PDF that are generate > > with > > > > FOP. Other PDFs are fine. > > > > > > > > What I find weird is shouldn't the "url" field just contain the path of > > the > > > > file? > > > > > > > > Anyone else seen this before? > > > > > > > > Any ideas? > > > > > > > > Thanks, > > > > > > > > Luke > > > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
That is correct. No difference with how other PDF are handled. I am looking at the index in Luke now. The FOP generated documents have a path but no URL? I would guess that these would be the same? Thanks for the speedy reply. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 21, 2005 12:34 PM Subject: Re: FOP Generated PDF and PDFBox > > > Are you indexing the FOP PDF's differently than other PDF documents? > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > method? > > Ben > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > Hello; > > > > Our CMS now allows users to create PDF documents (uses FOP) and than search > > them. > > > > I seem to be able to index these documents ok. But when I am generating the > > results to display I get a Null Pointer Exception while trying to use a > > variable that should contain the url keyword for one of these documents in > > the index: > > > > Document doc = hits.doc(i); > > String path = doc.get("url"); > > > > Path contains null. > > > > The interesting thing is this only happens with PDF that are generate with > > FOP. Other PDFs are fine. > > > > What I find weird is shouldn't the "url" field just contain the path of the > > file? > > > > Anyone else seen this before? > > > > Any ideas? > > > > Thanks, > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FOP Generated PDF and PDFBox
Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I seem to be able to index these documents ok. But when I am generating the results to display I get a Null Pointer Exception while trying to use a variable that should contain the url keyword for one of these documents in the index: Document doc = hits.doc(i); String path = doc.get("url"); Path contains null. The interesting thing is this only happens with PDF that are generate with FOP. Other PDFs are fine. What I find weird is shouldn't the "url" field just contain the path of the file? Anyone else seen this before? Any ideas? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: where to place the index directory
Hello Philippe; Things have gotten busy here for me. I can help you trouble shoot this a little later. Please email me directly if you still need help with this. Luke - Original Message - From: "philippe" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 14, 2005 12:45 PM Subject: Re: where to place the index directory > Luke, > > the jsp is in : > /var/www/html/capoeira > > and the index directory, called indexHtmlCapoeira, > /home/quilombo/indexHtmlCapoeira/index > > in the jsp, i'm giving the path, > String indexLocation = "/home/quilombo/indexHtmlCapoeira/index"; > > thanks for your help > > philippe > > > On Friday 14 January 2005 18:33, Luke Shannon wrote: > > The jsp is having some trouble locating the index folder. It is probably > > the path you are supplying when you create the File object for the index. > > When you create the File obkect what is the path you are passing in? > > > > - Original Message - > > From: "philippe" <[EMAIL PROTECTED]> > > To: "Lucene Users List" > > Sent: Friday, January 14, 2005 12:17 PM > > Subject: Re: where to place the index directory > > > > > yes Luke, > > > thank you for your help, > > > > > > the message is : > > > "indexHtml is not a directory" > > > > > > during some experimentations, the message has been > > > "unable to open the directory" > > > > > > thanks > > > philippe > > > > > > On Friday 14 January 2005 17:56, Luke Shannon wrote: > > > > Does it give some sort of error message? > > > > > > > > Luke > > > > > > > > - Original Message - > > > > From: "philippe" <[EMAIL PROTECTED]> > > > > To: > > > > Sent: Friday, January 14, 2005 11:39 AM > > > > Subject: where to place the index directory > > > > > > > > > Hi everybody, > > > > > > > > > > can someone help me ? > > > > > > > > > > i have a problem with my index ? > > > > > > > > > > on my localhost, everything is ok, > > > > > i can put my index directory in different places, it is accessed by > > > > > my > > > > > > > > jsp. > > > > > > > > > But on my hosting tomcat 4, my jsp can't open this directory > > > > > > > > > > have an idea ? > > > > > > > > > > thanks in advance > > > > > > > > > > philippe > > > > > > > > > > - > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: where to place the index directory
The jsp is having some trouble locating the index folder. It is probably the path you are supplying when you create the File object for the index. When you create the File obkect what is the path you are passing in? - Original Message - From: "philippe" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 14, 2005 12:17 PM Subject: Re: where to place the index directory > yes Luke, > thank you for your help, > > the message is : > "indexHtml is not a directory" > > during some experimentations, the message has been > "unable to open the directory" > > thanks > philippe > > On Friday 14 January 2005 17:56, Luke Shannon wrote: > > Does it give some sort of error message? > > > > Luke > > > > - Original Message - > > From: "philippe" <[EMAIL PROTECTED]> > > To: > > Sent: Friday, January 14, 2005 11:39 AM > > Subject: where to place the index directory > > > > > Hi everybody, > > > > > > can someone help me ? > > > > > > i have a problem with my index ? > > > > > > on my localhost, everything is ok, > > > i can put my index directory in different places, it is accessed by my > > > > jsp. > > > > > But on my hosting tomcat 4, my jsp can't open this directory > > > > > > have an idea ? > > > > > > thanks in advance > > > > > > philippe > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: where to place the index directory
Does it give some sort of error message? Luke - Original Message - From: "philippe" <[EMAIL PROTECTED]> To: Sent: Friday, January 14, 2005 11:39 AM Subject: where to place the index directory > Hi everybody, > > can someone help me ? > > i have a problem with my index ? > > on my localhost, everything is ok, > i can put my index directory in different places, it is accessed by my jsp. > > But on my hosting tomcat 4, my jsp can't open this directory > > have an idea ? > > thanks in advance > > philippe > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what if the IndexReader crashes, after delete, before close.
Here is how I handle it. The Indexer is a Runnable. All the members it uses are static. The run() method calls a syncronized method called go(). This kicks off the indexing. Before you even get to here, the method in the CMS code that created the thread object and instaniated the index is also sychronized. Here is the code that handles the potential lock file that may be left behind from a Reader or Writer. Note: I found I had to check if the index existed before checking if it was locked. If I checked if it was locked and the index had not been created yet I got an error. //if we have gotten to hear that this is the only index running. //the index should not be locked. if it is the lock is "stale" //and must be released before we can continue try { if (index.exists() && IndexReader.isLocked(indexFileLocation)) { Trace.ERROR("INDEX INFO: Had to clear a stale index lock"); IndexReader.unlock(FSDirectory.getDirectory(index, false)); } } catch (IOException e3) { Trace.ERROR("INDEX ERROR: IMPORTANT. Was unable to clear a stale index lock: " + e3); } HTH Luke - Original Message - From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, January 11, 2005 3:24 AM Subject: RE: what if the IndexReader crashes, after delete, before close. -Oorspronkelijk bericht- Van: Luke Shannon [mailto:[EMAIL PROTECTED] Verzonden: maandag 10 januari 2005 15:46 Aan: Lucene Users List Onderwerp: Re: what if the IndexReader crashes, after delete, before close. >>One thing that will happen is the lock file >>will get left behind. This means when you start >>back up and try to create another Reader you will >>get a file lock error. I have figured out that part the hard way ;) Why can`t I access my index anymore?? Ahh.. The lock file >>Our system is threaded and synchronized. >>Thus when a Reader is being created I know >>it is the only one (the Writer comes after >>the reader has been closed). Before creating >>it I check if the Index is locked. If it is, >>I forcefully clear it. This prevents the above >>problem from happening. You can have more than 1 reader open at anytime. Even while a delete or add is in progress. But you can`t use a reader where documents are deleted (IndexReader) and added(IndexWriter) at the same time. If you don`t have other threads doing delete/add you won`t have to synchronize anything. And how do you synchronize on it? I have applied the ReadWriteLock From Doug Lea`s concurrency library after I have build my own synchronization brick and somebody pointed out that I was implementing the ReadWriteLock. But at the moment I don`t do any synchronization. And I want to have a component that is executed if the system is started and knows that to do if there is rubbish in the index directory. I want that component to restore my index to a usable version (and even small loss of information is acceptable because everything is checked once and a while. And user-added-information is going to be stored in the database. So nothing gets lost. The index can be rebuild.. Luke - Original Message - From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]> To: Sent: Saturday, January 08, 2005 4:08 AM Subject: what if the IndexReader crashes, after delete, before close. What happens to the Index if the IndexReader crashes, after I have deleted documents, and before I have called close. Are the deletes ignored? Is the Index screwed up? Is the filesystem screwed up (if a document is deleted new delete-files appear) so are the delete-files still there (and can these be ignored the next time?). Can I restore the index to the previous state, just by removing those delete-files? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do you handle dynamic html pages?
I run the indexer in our CMS everytime a content change has occured. It is an incremental update so only documents that generate a different UID than the coresponding UID in the index get processed. Luke - Original Message - From: "Kevin L. Cobb" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, January 10, 2005 11:26 AM Subject: RE: How do you handle dynamic html pages? I don't like to periodically re-index everything because 1) you can't be confident that your searches are as up to date as they could be, and 2) you are wasting cycles either checking for documents that may or may not need to be updated, or re-indexing documents that don't need updated. Ideally, I think that you want an event driven system where the content management system or the like indicates to your searcher engine when a page/document gets updated. That way, you know that documents are as up to date as possible in terms of searches, and you know that you aren't doing unnecessary work. -Original Message- From: Luke Francl [mailto:[EMAIL PROTECTED] Sent: Monday, January 10, 2005 11:09 AM To: Lucene Users List Subject: Re: How do you handle dynamic html pages? On Mon, 2005-01-10 at 10:03, Jim Lynch wrote: > How is anyone managing reindexing of pages that change? Just > periodically reindex everything or do you try to determine frequency of > each changes to each page and/or site? If you are using a CMS, your best bet is to integrate Lucene with the CMS's content update mechanism. That way, your index will always be up-to-date. Otherwise, I would say reindexing everything is easiest, provided it doesn't take too long. If it's ~15 minutes or less, you could schedule a processes to do it at a low activity period (2 AM or whenever) every day and that would probably handle your needs. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what if the IndexReader crashes, after delete, before close.
One thing that will happen is the lock file will get left behind. This means when you start back up and try to create another Reader you will get a file lock error. Our system is threaded and synchronized. Thus when a Reader is being created I know it is the only one (the Writer comes after the reader has been closed). Before creating it I check if the Index is locked. If it is, I forcefully clear it. This prevents the above problem from happening. Luke - Original Message - From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]> To: Sent: Saturday, January 08, 2005 4:08 AM Subject: what if the IndexReader crashes, after delete, before close. What happens to the Index if the IndexReader crashes, after I have deleted documents, and before I have called close. Are the deletes ignored? Is the Index screwed up? Is the filesystem screwed up (if a document is deleted new delete-files appear) so are the delete-files still there (and can these be ignored the next time?). Can I restore the index to the previous state, just by removing those delete-files? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Check to see if index is optimized
This may not be a simple way, but you could just do a quick check on the folder to see if there is more than one file containing the name segment. Luke - Original Message - From: "Crump, Michael" <[EMAIL PROTECTED]> To: Sent: Friday, January 07, 2005 2:24 PM Subject: Check to see if index is optimized Hello, Lucene is great! I just have a question. Is there a simple way to check and see if an index is already optimized? What happens if optimize is called on an already optimized index - does the call basically do a noop? Or is it still and expensive call? Regards, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: questions
Hello Jac; If you have verified that the index folder is indeed being create and their is a segment(s) file(s) in it, check that the IndexSearcher in the demo is pointing to that location. This is a easy error to make and would account for the error message no segments folder. Luke - Original Message - From: "jac jac" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, January 07, 2005 2:03 AM Subject: questions > > Hi I am a newbie and i just installed Tomcat on my machine. > May I know, when i placed the Luceneweb folder in the webapps folder of Tomcat, how come I couldn't conduct the search operation when i test the website? Did I missed out anything? > > It prompts me that there is no c:\opt\index\segment folder... > I created but i still couldnt get Lucene to work... > > At http://jakarta.apache.org/lucene/docs/demo.html: > under the Indexing file instruction where should I do the following "type "java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src". "??? > Is it a must to install ant? > > Please kindly help!!! Thanks very much in advance > > regards, > jac > > > > - > Do you Yahoo!? > The all-new My Yahoo! - What will yours do? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to create a long lasting unique key?
This is taken from the example code writen by Doug Cutting that ships with Lucene. It is the key our system uses. It also comes in handy when incrementally updating. Luke public static String uid(File f) { // Append path and date into a string in such a way that lexicographic // sorting gives the same results as a walk of the file hierarchy. Thus // null (\u) is used both to separate directory components and to // separate the path from the date. return f.getPath().replace(dirSep, '\u') + "\u" + DateField.timeToString(f.lastModified()); } - Original Message - From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]> To: Sent: Tuesday, January 04, 2005 2:43 PM Subject: how to create a long lasting unique key? What is the best way to create a key for a document? I know the id (from hits) can not be used, but what is a good way to create a key I need this key for a webapplication. At the moment every document can be identified with the filelocation key, but I would rather some kind of integer for the Job (nobody needs to know the file location). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problems...
I had a similar situation with the same problem. I found the previous system was creating all the object (including the Searcher) and than updating the Index. The result was the Searcher was not able to find any of the data just added to the Index. The solution for me was to move the creation of the Searcher to after the Index had been updated and the Reader and Writer objects had been closed. Also ensure the Searcher uses the same Analyzer as the IndexWriter used to create the Index. This is a good tool for checking what is in your index. It may help with the trouble shooting: http://www.getopt.org/luke/ Luke - Original Message - From: "Ross Rankin" <[EMAIL PROTECTED]> To: Sent: Tuesday, January 04, 2005 10:53 AM Subject: Problems... > (Bear with me; I have inherited this system from another developer who is no > longer with the company. So I am not familiar with Lucene at all. I just > have got the task of "Fixing the search".) > > > > I have servlet that runs every 10 minutes and indexes and I can see files > being created in the index path on that interval (fdt,fdx,fnm,frq, etc.) > however the search function is no longer working. I'm not getting anything > in the log that I can point to that says what is not working, the search or > the index. But since the index files seem to change size/date stamp as they > have in the past, I'm leaning towards the search function. > > > > I'm not sure where or how to troubleshoot. Can I examine the indexes with > anything to see what is there and that it's meaningful. Is there something > simple I can do to track down what doesn't work in the process? Thanks. > > > > Ross > > > > Here's the search function: > > public Hits search(String searchString, String resellerId) { > > int currentOffset = 0; > > try { > > currentOffset = Integer.parseInt(paramOffset); > > } catch (NumberFormatException e) {} > > > > System.out.println("\n\t\tSearch for " + searchString + " off = " + > currentOffset); > > if (currentOffset > 0) { > > // if the user only requested the next n items from the search > returns > > return hits; > > } > > > > // performs a new search > > try { > > hits = null; > > try { > > searcher.close(); > > } catch (Exception e){} > > > > searcher = new IndexSearcher(pathToIndex); > > Analyzer analyzer = new StandardAnalyzer(); > > > > String searchQuery = LuceneConstants.FIELD_RESELLER_IDS + ":" > > + resellerId > > + " AND " > > + LuceneConstants.FIELD_FULL_DESCRIPTION + ":" + > searchString; > > > > > > Query query = null; > > try { > > query = QueryParser.parse(searchQuery, > > LuceneConstants.FIELD_FULL_DESCRIPTION, > analyzer); > > } catch (ParseException e) { > > // if an excepption occures parsing the search string > entered by the user > > // escapes all the special lucene chars and try to make the > query again. > > searchQuery = LuceneConstants.FIELD_RESELLER_IDS + ":" > > + resellerId + " AND " > > + LuceneConstants.FIELD_FULL_DESCRIPTION + ":" > > + escape(searchString); > > query = QueryParser.parse(searchQuery, > > LuceneConstants.FIELD_FULL_DESCRIPTION, > analyzer); > > } > > System.out.println("Searching for: " + > query.toString(LuceneConstants.FIELD_FULL_DESCRIPTION)); > > > > hits = searcher.search(query); > > System.out.println(hits.length() + " total matching documents"); > > //searcher.close(); > > > > } catch (Exception e) { > > e.printStackTrace(); > > } > > > > return hits; > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting an index
If you opened an IndexReader was has it also been closed before you attempt to delete? - Original Message - From: "Scott Smith" <[EMAIL PROTECTED]> To: Sent: Monday, January 03, 2005 7:39 PM Subject: Deleting an index I'm writing some junit tests for my search code (which layers on top of Lucene). The tests all follow the same pattern: 1. setUp(): Create some directories; write some files to be indexed 2. someTest: Call the indexer to create an index on the generated files; do several searches and verify counts, expected hits, etc.; 3. tearDown(): Delete all of the directories and associated files included the just-created index. My problem is that I am unable to delete the index. I've narrowed it down to something in the search routine not letting go of the index file (i.e., if I do the indexing and comment out the search, then everything deletes fine). The search code is pretty straight forward. It creates a new IndexSearcher (which it caches and hence uses for all searches in the test). Each individual search simply creates several QueryParsers and then combines them to do a search using the cached IndexSearcher. After the last search, I close() the IndexSearcher. But something still seems to have hold of the index. I've tried nulling the hits object, but that didn't seem to affect anything. Any ideas? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LIMO problems
This is a good place to start for extracting the content from power point files: http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html Luke - Original Message - From: "Daniel Cortes" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, December 13, 2004 10:46 AM Subject: Re: LIMO problems > > Hi, I want to know what library do you use for search in PPT files? > POI support this? > thanks > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action e-book now available!
Nice Work! Congratulations Guys. - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene User" <[EMAIL PROTECTED]>; "Lucene List" <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 3:52 AM Subject: Lucene in Action e-book now available! > The Lucene in Action e-book is now available at Manning's site: > > http://www.manning.com/hatcher2 > > Manning also put lots of other goodies there, the table of contents, > "about this book", preface, the foreward from Doug Cutting himself > (thanks Doug!!!), and a couple of sample chapters. The complete source > code is there as well. > > Now comes the exciting part to find out what others think of the work > Otis and I spent 14+ months of our lives on. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LIMO problems
I use "Luke". It is pretty good. http://www.getopt.org/luke/ Luke - Original Message - From: "Daniel Cortes" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, December 09, 2004 8:32 AM Subject: LIMO problems > Hi, I'm tying Limo (Index Monitor of Lucene) and I have a problem, > obviously it will be a silly problem but now I don't > have solution. > Someone can tell me how structure it have limo.properties file? > because I have any example thanks. > If you know another web-aplication for administration Lucenes Index say me. > Thanks for all, and excuse me for my silly questions. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Help to remove document
Hi; The indexReader has a delete method that can do this: public final void delete(int docNum) throws IOException Deletes the document numbered docNum. Once a document is deleted it will not appear in TermDocs or TermPostitions enumerations. Attempts to read its field with the document(int) method will result in an error. The presence of this document may still be reflected in the docFreq(org.apache.lucene.index.Term) statistic, though this will be corrected eventually as the index is further modified. There is an example of how it can be used in the Lucene demo. Ensure you re-create the indexSearcher for the change to be reflected in your search queries. Luke - Original Message - From: "Alex Kiselevski" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, December 08, 2004 9:34 AM Subject: Help to remove document Hello, Help me pls, I want to know how to remove document from index Alex Kiselevsky Speech Technology Tel: 972-9-776-43-46 R&D, Amdocs - Israel Mobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weird Behavior On Windows
Hey Ottis; You're right again. Turned out there was a exception around the usage of the Digester class that wasn't being written to the log. This exception was being thrown as a result of a configuration issue with the server. Everything is back to normal. Thanks! Luke - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, December 07, 2004 6:27 PM Subject: Re: Weird Behavior On Windows > The index has been modified, so you need a new IndexSearcher. Could > there be logic in the flaw (swap that), or could you be catching an > Exception that is thrown only on Winblows due to Windows not letting > you do certain things with referenced files and dirs? > > Otis > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > Hello All; > > > > Things have been running smoothly on Linux for sometime. We set up a > > version > > of the site on a Win2K machine, this is when all the "fun" started. > > > > A pdf would be added to the system. The indexer would run, find the > > new > > file, index it and successfully complete the update of the index > > folder. No > > IO error, no errors of any kind. Just like on the Linux box. > > > > Now we would try to search for a term in the document. 0 results > > would be > > returned? To make matters worse if I run a search on a term that > > shows up in > > a bunch of documents on windows it only find 2 results, where in > > Linux it > > would find 50 (same content). > > > > Using "Luke" I was able to verify that the pdf in question is in the > > index. > > Why can't the searcher find it? > > > > Any ideas would be welcome. > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weird Behavior On Windows
Hi Otis; Each time a search request comes in I create a new searcher (same analyzer as used during indexing). The idea about catching an error somewhere is interesting, although in most of the cases where I catch an exception I write to a log file. Anyway, this is all I have to gone on so I am looking into exceptions now... Luke - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, December 07, 2004 6:27 PM Subject: Re: Weird Behavior On Windows > The index has been modified, so you need a new IndexSearcher. Could > there be logic in the flaw (swap that), or could you be catching an > Exception that is thrown only on Winblows due to Windows not letting > you do certain things with referenced files and dirs? > > Otis > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > Hello All; > > > > Things have been running smoothly on Linux for sometime. We set up a > > version > > of the site on a Win2K machine, this is when all the "fun" started. > > > > A pdf would be added to the system. The indexer would run, find the > > new > > file, index it and successfully complete the update of the index > > folder. No > > IO error, no errors of any kind. Just like on the Linux box. > > > > Now we would try to search for a term in the document. 0 results > > would be > > returned? To make matters worse if I run a search on a term that > > shows up in > > a bunch of documents on windows it only find 2 results, where in > > Linux it > > would find 50 (same content). > > > > Using "Luke" I was able to verify that the pdf in question is in the > > index. > > Why can't the searcher find it? > > > > Any ideas would be welcome. > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Weird Behavior On Windows
Hello All; Things have been running smoothly on Linux for sometime. We set up a version of the site on a Win2K machine, this is when all the "fun" started. A pdf would be added to the system. The indexer would run, find the new file, index it and successfully complete the update of the index folder. No IO error, no errors of any kind. Just like on the Linux box. Now we would try to search for a term in the document. 0 results would be returned? To make matters worse if I run a search on a term that shows up in a bunch of documents on windows it only find 2 results, where in Linux it would find 50 (same content). Using "Luke" I was able to verify that the pdf in question is in the index. Why can't the searcher find it? Any ideas would be welcome. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Read locks on indexes
I think the read locks are preventing you from deleting from the index with your reader and writing to the index with a writer at the same time. If you never use a writer than I guess you don't need to worry about this. But how do you create the indexes? Luke - Original Message - From: "Shawn Konopinsky" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, December 07, 2004 4:17 PM Subject: Read locks on indexes > Hi, > > I have a question regarding read locks on indexes. I have the situation > where I have n applications (separated jvms) running queries. These > applications are read-only, and never use an IndexWriter. > > The index is only ever updated using rsync. The applications don't need > up the minute updates, only the data from when the reader was created is > fine. > > My question is whether it's ok to disable read locks in this scenario? > What are read locks protecting? > > Best, > Shawn. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Indexing Error
Hi Ben; Actually I think I did update PDFBox. I will put it back to the version I previously had. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, December 02, 2004 8:20 PM Subject: Re: PDF Indexing Error > > This error is because of security settings that have been applied to the > PDF document which disallow text extraction. > > Not sure why you would all of a sudden get this error, unless you upgraded > recently. Older versions of PDFBox did not fully support PDF security. > > Ben > > On Thu, 2 Dec 2004, Luke Shannon wrote: > > > Hello All; > > > > Perhaps this should be on the PDFBox forum but I was curious if anyone has > > seen this error parsing PDF documents using packages other than PDFBox. > > > > /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf > > java.io.IOException: You do not have permission to extract text > > > > The weird thing is it gave this error on a document I have indexed a million > > times over the last 3 weeks. > > > > Thanks, > > > > Luke > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF Indexing Error
Hello All; Perhaps this should be on the PDFBox forum but I was curious if anyone has seen this error parsing PDF documents using packages other than PDFBox. /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf java.io.IOException: You do not have permission to extract text The weird thing is it gave this error on a document I have indexed a million times over the last 3 weeks. Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimized??
As I understand it optimization is when you merge several segments into one allowing for faster queries. The FAQs and API have further details. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q24 Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, November 20, 2004 5:19 PM Subject: Optimized?? What`s mean Optimized index in Lucene¿? -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How much time indexing doc ??
PDF(s) can definitely slow things down, depending on their size. If there are a few larger PDF documents that time is definitely possible. Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, November 20, 2004 11:25 AM Subject: How much time indexing doc ?? > Hi, i have 1000 doc (Word, PDF and HTML) , those documents indexed > in 5 min. Is this correct?? or i have problem with my Analyzer, i > used StandartAnalyzer > -- > Miguel Angel Angeles R. > Asesoria en Conectividad y Servidores > Telf. 97451277 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
False Locking Conflict?
Hey All; Is it possible for there to be a situation where the locking file is in place after the reader has been closed? I have extra logging in place and have followed the code execution. The reader finishes deleting old content and closes (I know this for sure). This is the only reader instance I have for the class (it is a static member). The reader is not re-opened. I try to open the writer and I get my old friend: java.io.IOException: Lock obtain timed out: Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d43210f7fe8-write.lock This code is synchronized so I am sure there is no other processes trying to do the same thing. It looks to me like the reader is closing and the lock file is not being removed. Is this possible? Luke
Re: DOC, PPT index???
Check out: http://jakarta.apache.org/poi/ - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, November 18, 2004 4:49 PM Subject: DOC, PPT index??? > Hi !!! > Lucene can index the files (do, ppt the MS OFFICE ??) > How do you can this index (doc, ppt) > -- > Miguel Angel Angeles R. > Asesoria en Conectividad y Servidores > Telf. 97451277 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
Thank you for the suggestion. I ended up biting the bullet and re-working my indexing logic. Luckily the system itself knows what the "current" version of a document is (otherwise it won't know which one to display to the user) for any given folder. I was able to get a static method I could call passing in a folder name. The method returns the file name of the current version for that folder. Each time I am doing an incremental update if I find that a document from a folder hasn't changed I make sure it is the current version before moving on. If it isn't I remove it from the index. Than when I am creating a new index or adding files to an existing, for each file, I have to check the file I am adding to ensure it is the current version for the folder before adding it. As you can imagine this slows down indexing (creating a new or updating an existing) but it ensures content from an old version will never show up in a query. Luke - Original Message - From: "Yonik Seeley" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; "Justin Swanhart" <[EMAIL PROTECTED]> Sent: Thursday, November 18, 2004 1:32 PM Subject: Re: version documents > This won't fully work. You still need to delete the > original out of the lucene index to avoid it showing > up in searches. > > Example: > myfile v1: "I want a cat" > myfile v2: "I want a dog" > > If you change "cat" to "dog" in myfile, and then do a > search for "cat", you will *only* get v1 and hence the > sort on version doesn't help. > > -Yonik > > > --- Justin Swanhart <[EMAIL PROTECTED]> wrote: > > Split the filename into "basefilename" and "version" > > and make each a keyword. > > > > Sort your query by version descending, and only use > > the first > > "basefile" you encounter. > > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Index Time
Hi Ben; Thank you creating such a easy to use package for indexing PDF. I will keep PDFBox in the system and wait for the next release. Thanks for the update. Luke - Original Message - From: "Ben Litchfield" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, November 18, 2004 12:33 PM Subject: Re: PDF Index Time > > PDFBox is slow, there is an open issue for it on the sourceforge site and > I am actively working on improving speed and should see significant > improvements in the next release. > > I have not extensively tried the snowtide package but they have a trial > download and the docs show that it should be just as easy to integrate as > PDFBox is. They list pricings on there site as well, which is nice that > it is not hidden as some software companies do. > > Ben > > On Thu, 18 Nov 2004, Luke Shannon wrote: > > > Hi; > > > > I am using the PDFBox's getLuceneDocument method to parse my PDF > > documents. It returns good results and was very easy to integrate into > > the project. However it is slow. > > > > Does anyone know of a faster package? Someone mentioned snowtide on an > > earlier post. Anyone have experience with this package? > > > > Luke > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF Index Time
Hi; I am using the PDFBox's getLuceneDocument method to parse my PDF documents. It returns good results and was very easy to integrate into the project. However it is slow. Does anyone know of a faster package? Someone mentioned snowtide on an earlier post. Anyone have experience with this package? Luke
Re: urgent help needed
These are the ones I think. They were the first things I read on Lucene and were very helpful. http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html - Original Message - From: "Neelam Bhatnagar" <[EMAIL PROTECTED]> To: "Otis Gospodnetic" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, November 18, 2004 10:45 AM Subject: RE: urgent help needed Hello, Thank you for your help. Could you tell us the URL of the online version of these articles? Thanks and regards Neelam Bhatnagar -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 18, 2004 9:12 PM To: [EMAIL PROTECTED] Cc: Neelam Bhatnagar Subject: Re: urgent help needed Redirecting to the more appropriate lucene-user list. Hello, About 2 years ago I wrote 2 articles for O'Reilly Network, where I believe I mentioned this issue and provided some context. Make sure your index is optimized. If that doesn't help, switch to the compound index format (1 set call on IndexWriter instance). You can also adjust your OS's limits - the article I mentioned cover this for a few UNIX shells. Otis --- Neelam Bhatnagar <[EMAIL PROTECTED]> wrote: > Hi, > > I have posted this several times before but there has been no > response. > We really need to resolve this as soon as possible. Kindly help us. > > We have been using Lucene 3.1 version with Tomcat 4.0 and jdk1.4. > It seems that sometimes we see a "Too many files open" exception > which > completely garbles the whole index and whole search functionality > crashes on the web site. It has also been known to crash the complete > JSP container of tomcat. > > After looking at the bug list, we found out that it has been reported > as > a bug in the Lucene bug list as Bug#29774, #30049, #30452 which > claims > to have been resolved with the new version of Lucene. > > We have tried everything to reproduce the problem ourselves to figure > out the exact circumstances under which it occurs but with out any > luck. > > > We would be installing the new version of Lucene but we need to be > able > to reproduce the problem to test it. > > We would really appreciate it if someone could point us to the root > cause behind this so we can devise a solution around that. > > Thanks and regards > Neelam Bhatnagar > > Technology| Sapient > Presidency Building > Mehrauli-Gurgaon Road > Sector-14, Gurgaon-122001 > Haryana, India > > Tel: 91.124.2826299 > Cell: 91.9899591054 > Email: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
That is a good idea. Thanks! - Original Message - From: "Justin Swanhart" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, November 17, 2004 3:38 PM Subject: Re: version documents > Split the filename into "basefilename" and "version" and make each a keyword. > > Sort your query by version descending, and only use the first > "basefile" you encounter. > > On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon > <[EMAIL PROTECTED]> wrote: > > Hey all; > > > > I have ran into an interesting case. > > > > Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. > > > > The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). > > > > My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. > > > > I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. > > > > Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? > > > > Thanks, > > > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
version documents
Hey all; I have ran into an interesting case. Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. >From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? Thanks, Luke
Re: index document pdf
Hello; Hopfully I understand the question. 1. Modify the indexDoc(file) method to consider the file type pdf: else if (file.getPath().endsWith(".html") || file.getPath().endsWith(".pdf")) { 2. Create a specific branch of code to create the lucene document from the file type and than add it to the index: if (file.getPath().endsWith(".pdf")) { try { Document doc = LucenePDFDocument.getDocument(file); writer.addDocument(doc); } catch (Exception e) { System.out.println("INDEXING ERROR: Unable to index pdf document: " + file.getPath() + " " + e.getMessage()); } } Note: Ensure you do step 2 for the case when uidIter != null and when it is equal to null. That should do it. Concerning pdfbox make sure you have all the jars required. I had a little trouble getting this going at first. It needs log4j.jar to run. If you have any problems with the appenders I found this message thread helpful. http://java2.5341.com/msg/32909.html Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, November 17, 2004 12:28 PM Subject: index document pdf > Hi, i downloading pdfbox 0.6.4 , what add in the source code the > demo`s lucene > > -- > Miguel Angel Angeles R. > Asesoria en Conectividad y Servidores > Telf. 97451277 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: tool to check the index field
Try this: http://www.getopt.org/luke/ Luke - Original Message - From: "lingaraju" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, November 17, 2004 10:00 AM Subject: tool to check the index field > HI ALL > > I am having index file created by other people > Now i want to know how many field are there in the index > Is there any third party tool to do this > I saw some where some GUI tool to do this but forgot the name. > > Regards > LingaRaju > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index Locking Issues Resolved...I hope
Hello; I think I have solved my locking issues. I just made it through the set of test cases that previously resulted in Index Locking Errors. I just removed the method from my code that checks for a Index lock and forcefully removes it after 1 minute. Hopefully they never need to be put back in. Here is what I changed: I moved all my Indexer logic into a class called Index.java that implemented Runnable. Index's start() called a method named go() which was static and synchronized. go() kicks off all the logic to update the index (the reader, writer and other members involved with incremental updates also static). I put logging in place that logs when a thread has executed the method and what the thread's name is. Every time a client class changes the content it can create a thread reference and pass it the runnable Index. The convention I have requested for naming the thread is a toString() of the current date. Then they start the thread. How it worked: A few users just tested the system, half added documents to the system while another half deleted documents at the same time. No locking issues were seen and the index was current with the changes made a short time after the last operation (in my previous code this test resulted in a issue with index locking). I was able to go through the log file and find the start of the synchronized go() method and the successful completion of the indexing operations for every request made. The only performance issue I noticed was if someone added a very large PDF it took a while before the thread handling the request could finish. If this is the first operation of many it means the operations following this large file take that much longer. Luckily for me search results don't need to be instant. Things are looking much better. For now... Thanks to all that helped me up till now. Luke - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 4:01 PM Subject: Re: _4c.fnm missing > 'Concurrent' and 'updates' in the same sentence sounds like a possible > source of the problem. You have to use a single IndexWriter and it > should not overlap with an IndexReader that is doing deletes. > > Otis > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > It conistantly breaks when I run more than 10 concurrent incremental > > updates. > > > > I can post the code on Bugzilla (hopefully when I get to the site it > > will be > > obvious how I can post things). > > > > Luke > > > > - Original Message - > > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Tuesday, November 16, 2004 3:20 PM > > Subject: Re: _4c.fnm missing > > > > > > > Field names are stored in the field info file, with suffix .fnm. - > > see > > > http://jakarta.apache.org/lucene/docs/fileformats.html > > > > > > The .fnm should be inside the .cfs file (cfs files are compound > > files > > > that contain all index files described at the above URL). Maybe > > you > > > can provide the code that causes this error in Bugzilla for > > somebody to > > > look at. Does it consistently break? > > > > > > Otis > > > > > > > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > > > > > I received the error below when I was attempting to over whelm my > > > > system with incremental update requests. > > > > > > > > What is this file it is looking for? I checked the index. It > > > > contains: > > > > > > > > _4c.del > > > > _4d.cfs > > > > deletable > > > > segments > > > > > > > > Where does _4c.fnm come from? > > > > > > > > Here is the error: > > > > > > > > Unable to create the create the writer and/or index new content > > > > /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or > > directory). > > > > > > > > Thanks, > > > > > > > > Luke > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
It doesn't have to be to the second. If things take a few minutes it's ok. It looks like the first lock issue I'm hitting in my program is when I try and delete from the Index for the first time. No writer has been created yet, only the reader so I am not sure why it thinks its locked. - Original Message - From: "Nader Henein" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 4:18 PM Subject: Re: _4c.fnm missing > That's it, you need to batch your updates, it comes down to do you need to give your users search accuracy to the second, take your database and put an is_dirty row on the master table of the object you're indexing and run a scheduled task every x minutes and have your process read the objects that are set to dirty and then re set the flag once they've been indexed correctly. > > my two cents > > Nader > > > > Otis Gospodnetic wrote: > > >'Concurrent' and 'updates' in the same sentence sounds like a possible > >source of the problem. You have to use a single IndexWriter and it > >should not overlap with an IndexReader that is doing deletes. > > > >Otis > > > >--- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > > > > > >>It conistantly breaks when I run more than 10 concurrent incremental > >>updates. > >> > >>I can post the code on Bugzilla (hopefully when I get to the site it > >>will be > >>obvious how I can post things). > >> > >>Luke > >> > >>- Original Message - > >>From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > >>To: "Lucene Users List" <[EMAIL PROTECTED]> > >>Sent: Tuesday, November 16, 2004 3:20 PM > >>Subject: Re: _4c.fnm missing > >> > >> > >> > >> > >>>Field names are stored in the field info file, with suffix .fnm. - > >>> > >>> > >>see > >> > >> > >>>http://jakarta.apache.org/lucene/docs/fileformats.html > >>> > >>>The .fnm should be inside the .cfs file (cfs files are compound > >>> > >>> > >>files > >> > >> > >>>that contain all index files described at the above URL). Maybe > >>> > >>> > >>you > >> > >> > >>>can provide the code that causes this error in Bugzilla for > >>> > >>> > >>somebody to > >> > >> > >>>look at. Does it consistently break? > >>> > >>>Otis > >>> > >>> > >>>--- Luke Shannon <[EMAIL PROTECTED]> wrote: > >>> > >>> > >>> > >>>>I received the error below when I was attempting to over whelm my > >>>>system with incremental update requests. > >>>> > >>>>What is this file it is looking for? I checked the index. It > >>>>contains: > >>>> > >>>>_4c.del > >>>>_4d.cfs > >>>>deletable > >>>>segments > >>>> > >>>>Where does _4c.fnm come from? > >>>> > >>>>Here is the error: > >>>> > >>>>Unable to create the create the writer and/or index new content > >>>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or > >>>> > >>>> > >>directory). > >> > >> > >>>>Thanks, > >>>> > >>>>Luke > >>>> > >>>> > >>> > >>> > >>> > >>- > >> > >> > >>>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>For additional commands, e-mail: > >>> > >>> > >>[EMAIL PROTECTED] > >> > >> > >>> > >>> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
This is the latest error I have received: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations I need synchronize this process more carefully. I think this goes back to the point that during my incremental update I sometimes need to forcefully clear the lock on the Index. I am not managing the deleting and writing to the Index correctly. The first thing I am doing is tracking down the cause of this situation so I don't need to forcefully clear locks anymore. - Original Message - From: "Nader Henein" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 3:39 PM Subject: Re: _4c.fnm missing > what kind of incremental updates are you doing, because we update our index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB memory resident index, the IndexWriter runs one instance at a time, so what kind of increments are we talking about it takes a bit of doing to overwhelm Lucene. > > What's your update schedule, how big is the index, and after how many updates does the system crash? > > Nader Henein > > > > Luke Shannon wrote: > > >It conistantly breaks when I run more than 10 concurrent incremental > >updates. > > > >I can post the code on Bugzilla (hopefully when I get to the site it will be > >obvious how I can post things). > > > >Luke > > > >- Original Message - > >From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Tuesday, November 16, 2004 3:20 PM > >Subject: Re: _4c.fnm missing > > > > > > > > > >>Field names are stored in the field info file, with suffix .fnm. - see > >>http://jakarta.apache.org/lucene/docs/fileformats.html > >> > >>The .fnm should be inside the .cfs file (cfs files are compound files > >>that contain all index files described at the above URL). Maybe you > >>can provide the code that causes this error in Bugzilla for somebody to > >>look at. Does it consistently break? > >> > >>Otis > >> > >> > >>--- Luke Shannon <[EMAIL PROTECTED]> wrote: > >> > >> > >> > >>>I received the error below when I was attempting to over whelm my > >>>system with incremental update requests. > >>> > >>>What is this file it is looking for? I checked the index. It > >>>contains: > >>> > >>>_4c.del > >>>_4d.cfs > >>>deletable > >>>segments > >>> > >>>Where does _4c.fnm come from? > >>> > >>>Here is the error: > >>> > >>>Unable to create the create the writer and/or index new content > >>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). > >>> > >>>Thanks, > >>> > >>>Luke > >>> > >>> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
The schedule is determined by the users of the system. Basically when the user(s) change the content (adding/deleting a folder or file, modify a file's content) through a web based interface a re-index is required of the content. This could happen 20 times in the span of a few seconds or once in an hour. I doubt I am overwhelming Lucene, I think the problem is in my code and how I am managing the deleting and writing to the Index. - Original Message - From: "Nader Henein" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 3:39 PM Subject: Re: _4c.fnm missing > what kind of incremental updates are you doing, because we update our index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB memory resident index, the IndexWriter runs one instance at a time, so what kind of increments are we talking about it takes a bit of doing to overwhelm Lucene. > > What's your update schedule, how big is the index, and after how many updates does the system crash? > > Nader Henein > > > > Luke Shannon wrote: > > >It conistantly breaks when I run more than 10 concurrent incremental > >updates. > > > >I can post the code on Bugzilla (hopefully when I get to the site it will be > >obvious how I can post things). > > > >Luke > > > >- Original Message - > >From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Tuesday, November 16, 2004 3:20 PM > >Subject: Re: _4c.fnm missing > > > > > > > > > >>Field names are stored in the field info file, with suffix .fnm. - see > >>http://jakarta.apache.org/lucene/docs/fileformats.html > >> > >>The .fnm should be inside the .cfs file (cfs files are compound files > >>that contain all index files described at the above URL). Maybe you > >>can provide the code that causes this error in Bugzilla for somebody to > >>look at. Does it consistently break? > >> > >>Otis > >> > >> > >>--- Luke Shannon <[EMAIL PROTECTED]> wrote: > >> > >> > >> > >>>I received the error below when I was attempting to over whelm my > >>>system with incremental update requests. > >>> > >>>What is this file it is looking for? I checked the index. It > >>>contains: > >>> > >>>_4c.del > >>>_4d.cfs > >>>deletable > >>>segments > >>> > >>>Where does _4c.fnm come from? > >>> > >>>Here is the error: > >>> > >>>Unable to create the create the writer and/or index new content > >>>/usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). > >>> > >>>Thanks, > >>> > >>>Luke > >>> > >>> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: _4c.fnm missing
It conistantly breaks when I run more than 10 concurrent incremental updates. I can post the code on Bugzilla (hopefully when I get to the site it will be obvious how I can post things). Luke - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 3:20 PM Subject: Re: _4c.fnm missing > Field names are stored in the field info file, with suffix .fnm. - see > http://jakarta.apache.org/lucene/docs/fileformats.html > > The .fnm should be inside the .cfs file (cfs files are compound files > that contain all index files described at the above URL). Maybe you > can provide the code that causes this error in Bugzilla for somebody to > look at. Does it consistently break? > > Otis > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > I received the error below when I was attempting to over whelm my > > system with incremental update requests. > > > > What is this file it is looking for? I checked the index. It > > contains: > > > > _4c.del > > _4d.cfs > > deletable > > segments > > > > Where does _4c.fnm come from? > > > > Here is the error: > > > > Unable to create the create the writer and/or index new content > > /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). > > > > Thanks, > > > > Luke > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
_4c.fnm missing
I received the error below when I was attempting to over whelm my system with incremental update requests. What is this file it is looking for? I checked the index. It contains: _4c.del _4d.cfs deletable segments Where does _4c.fnm come from? Here is the error: Unable to create the create the writer and/or index new content /usr/tomcat/fb_hub/WEB-INF/index/_4c.fnm (No such file or directory). Thanks, Luke
Re: IndexSearcher Refresh
Yes it will. Thanks. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 10:28 AM Subject: Re: IndexSearcher Refresh > This will help: > > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getCurrentVersion(org.apache.lucene.store.Directory) > > Otis > > > --- Luke Shannon <[EMAIL PROTECTED]> wrote: > > > It would nice if the IndexerSearcher contained a method that could > > return > > the last modified date of the index folder it was created with. > > > > This would make it easier to know when you need to create a new > > Searcher. > > > > - Original Message - > > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > > To: "Lucene Users List" <[EMAIL PROTECTED]> > > Sent: Tuesday, November 16, 2004 8:23 AM > > Subject: Re: IndexSearcher Refresh > > > > > > > I don't think so, you have to forget or close the old one and > > create a > > > new instance. > > > > > > Otis > > > > > > --- Ravi <[EMAIL PROTECTED]> wrote: > > > > > > > Is there a way to refresh the IndexSearcher object with the newly > > > > added > > > > documents to the index instead of creating a new object? > > > > > > > > Thanks in advance, > > > > Ravi. > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how do you work with PDF
www.pdfbox.org Once you get the package installed the code you can use is: Document doc = LucenePDFDocument.getDocument(file); writer.addDocument(doc); This method returns the PDF in Lucene document format. Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 10:19 AM Subject: how do you work with PDF > Hi, i need know how do you work with PDF, please give the process. > Thanks... > > -- > Miguel Angel Angeles R. > Asesoria en Conectividad y Servidores > Telf. 97451277 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher Refresh
It would nice if the IndexerSearcher contained a method that could return the last modified date of the index folder it was created with. This would make it easier to know when you need to create a new Searcher. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 16, 2004 8:23 AM Subject: Re: IndexSearcher Refresh > I don't think so, you have to forget or close the old one and create a > new instance. > > Otis > > --- Ravi <[EMAIL PROTECTED]> wrote: > > > Is there a way to refresh the IndexSearcher object with the newly > > added > > documents to the index instead of creating a new object? > > > > Thanks in advance, > > Ravi. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is opening IndexReader multiple times safe?
Hi Satoshi; (B (BI troubled shooted a problem similar to this by moving around a (BIndexReader.isLocked(indexFileLocation) to determine exactly when the reader (Bwas closed. (B (BNote: the method throws an error if the index file doesn't exist that you (Bare checking on. (B (BLuke (B (B- Original Message - (BFrom: "Satoshi Hasegawa" <[EMAIL PROTECTED]> (BTo: "Lucene Users List" <[EMAIL PROTECTED]> (BSent: Monday, November 15, 2004 8:25 PM (BSubject: Is opening IndexReader multiple times safe? (B (B (B> Hello, (B> (B> I need to handle IOExceptions that arise from index access (B> (IndexReader#open, #delete, IndexWriter#optimize etc.), and I'm not sure (Bif (B> the IndexReader is open when the exception is thrown/caught. Specifically, (B> my code is as follows. (B> (B> try { (B> indexReader.delete(term); (B> indexReader.close(); (B> IndexWriter indexWriter = new IndexWriter(fsDirectory, (B> new JapaneseAnalyzer(), false); (B> indexWriter.optimize(); (B> indexWriter.close(); (B> } catch (Exception e) { (B> // IndexReader may or may not be open (B> indexReader = IndexReader.open(path); (B> indexReader.undelete(); (B> } (B> (B> Is the above code safe? IndexReader may already be open at the beginning (Bof (B> the catch clause if the exception was thrown before closing the (BIndexReader. (B> (B> (B> (B> - (B> To unsubscribe, e-mail: [EMAIL PROTECTED] (B> For additional commands, e-mail: [EMAIL PROTECTED] (B> (B> (B (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene : avoiding locking (incremental indexing)
I like the sound of the Queue approach. I also don't like that I have to focefully unlock the index. I'm not the most experience programmer and am on a tight deadline. The approach I ended up with was the best I could do with the experience I've got and the time I had. My indexer works so far and doesn't have to forcefully release the lock on the Index too often (the case is most likely to occur when someone removes a content file(s) and the reader needs to delete from the existing index for the first time). We will see what happens as more people use the system with large content directories. As I learn more I plan to expand the functionality of my class. Luke S - Original Message - From: <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, November 15, 2004 5:50 PM Subject: Re: Lucene : avoiding locking (incremental indexing) > It really seems like I am not the only person having this issue. > > So far I am seeing 2 solutions and honestly I don't love either totally. I am thinking that without changes to Lucene itself, the best "general" way to implement this might be to have a queue of changes and have Lucene work off this queue in a single thread using a time-settable batch method. This is similar to what you are using below, but I don't like that you forcibly unlock Lucene if it shows itself locked. Using the Queue approach, only that one thread could be accessing Lucene for writes/deletes anyway so there should be no "unknown" locking. > > I can imagine this being a very good addition to Lucene - creating a high level interface to Lucene that manages incremental updates in such a manner. If anybody has such a general piece of code, please post it!!! I would use it tonight rather then create my own. > > I am not sure if there is anything that can be done to Lucene itself to help with this need people seem to be having. I realize the likely reasons why Lucene might need to only have one Index writer and the additional load that might be caused by locking off pieces of the database rather then the whole database. I think I need to look in the developer archives. > > JohnE > > > > - Original Message - > From: Luke Shannon <[EMAIL PROTECTED]> > Date: Monday, November 15, 2004 5:14 pm > Subject: Re: Lucene : avoiding locking (incremental indexing) > > > Hi Luke; > > > > I have a similar system (except people don't need to see results > > immediatly). The approach I took is a little different. > > > > I made my Indexer a thread with the indexing operations occuring > > the in run > > method. When the IndexWriter is to be created or the IndexReader > > needs to > > execute a delete I called the following method: > > > > private void manageIndexLock() { > > try { > > //check if the index is locked and deal with it if it is > > if (index.exists() && IndexReader.isLocked(indexFileLocation)) { > >System.out.println("INDEXING INFO: There is more than one > > process trying > > to write to the index folder. Will wait for index to become > > available.");//perform this loop until the lock if released or > > 3 mins > >// has expired > >int indexChecks = 0; > >while (IndexReader.isLocked(indexFileLocation) > > && indexChecks < 6) { > > //increment the number of times we check the index > > // files > > indexChecks++; > > try { > > //sleep for 30 seconds > > Thread.sleep(3L); > > } catch (InterruptedException e2) { > > System.out.println("INDEX ERROR: There was a problem waiting > > for the > > lock to release. " > > + e2.getMessage()); > > } > >}//closes the while loop for checking on the index > >// directory > >//if we are still locked we need to do something about it > >if (IndexReader.isLocked(indexFileLocation)) { > > System.out.println("INDEXING INFO: Index Locked After 3 > > minute of > > waiting. Forcefully releasing lock."); > > IndexReader.unlock(FSDirectory.getDirectory(index, false)); > > System.out.println("INDEXING INFO: Index lock released"); > >}//close the if that actually releases the lock > > }//close the if ensure the file exists > > }//closes the try for all the above operations > > catch (IOException e1) { > > System.out.println("INDEX ERROR: There was a problem waiting > > for the lock > > to release. " > > + e1.getMessage()); > > } > > }//close the manageIndexLock method > > > &