Re: Unexpected end in indexing HTML file

2004-01-19 Thread Erik Hatcher
On Jan 19, 2004, at 7:27 PM, Syrén Per wrote: Hi all, Have a question concerning indexing of HTML files. One of the files I'm trying to index have a input type=image ... tag that also contain a call to a javascript with a string argument that is about 1300 characters long. At this point Lucene

Re: Extracting particular document from index

2004-01-18 Thread Erik Hatcher
On Jan 18, 2004, at 11:15 AM, Karl Koch wrote: lets say I have an index with documents encoded in two fields filename and data. Is it possible to extract a file from which I know the filename directly from this index without performing any search. Like a random access like in a filesystem? It is

Re: lucene not indexing under apache 2.0/windows?

2004-01-15 Thread Erik Hatcher
You're missing something in your explanation. Lucene does not create XML files. On Jan 15, 2004, at 11:35 AM, Pierce, Tania wrote: Let me preface this by saying I am a total beginner to apache/java/tomcat/cocoon etc. I'm thankfully fluent in xml/xslt or this would be a nightmare. Anyway, I

Re: Philosophy(??) question

2004-01-14 Thread Erik Hatcher
- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 13, 2004 3:19 AM To: Lucene Users List Subject: Re: Philosophy(??) question On Jan 12, 2004, at 7:59 PM, Scott Smith wrote: I have some documents I'm indexing which have multiple languages in them (i.e., some fields

Re: Philosophy(??) question

2004-01-13 Thread Erik Hatcher
On Jan 12, 2004, at 7:59 PM, Scott Smith wrote: I have some documents I'm indexing which have multiple languages in them (i.e., some fields in the document are always English; other fields may be other languages). Now, I understand why a query against a certain field must use the same analyzer

Re: Getting word freqency?

2004-01-13 Thread Erik Hatcher
On Jan 13, 2004, at 7:26 AM, [EMAIL PROTECTED] wrote: Example: I have a very long text. I parse these text with an WhitespaceAnalyser. From this Text I generate an Index. From this index I get each word together with its alsolute frequency / relative frequency. Can I do it without generating an

Re: Query question

2004-01-13 Thread Erik Hatcher
On Jan 12, 2004, at 7:49 PM, Scott Smith wrote: Does the following do that: BooleanQuery Query QA = new Boolean Query(); Query qa1 = QueryParser.parse(A1, FieldA, analyzer()); Query qa2 = QueryParser.parse(A2, FieldA, analyzer()); QA.add(qa1, false, false); //

Re: Query question

2004-01-13 Thread Erik Hatcher
On Jan 13, 2004, at 5:21 PM, Scott Smith wrote: I guess what is confusing me now is that the search code no longer references an analyzer???!!! How does it know how to tokenize, stem, etc. the search terms? It doesn't. A TermQuery is exactly as-is. If you need the analysis part, you can use

Re: StandardAnalyzer and numbers indexed as text

2004-01-13 Thread Erik Hatcher
On Jan 13, 2004, at 6:19 PM, Patrick Kates wrote: I have a text field called ACTIVE_YEAR that stores (of course) a year like 2003. When I index this field I can see the number in my index (using Luke) but I can't search it. If I add a text character to the end of the field and index it (200x)

Re: Lucene based projects...?

2004-01-12 Thread Erik Hatcher
On Jan 12, 2004, at 6:24 AM, [EMAIL PROTECTED] wrote: who knows other software projects (like Nutch) which are based and build around Lucene?? I think it can be quite interesting and helpful for new people to see and learn from examples... This is the purpose of the Powered by section on

Re: merged search of document

2004-01-12 Thread Erik Hatcher
On Jan 12, 2004, at 8:21 AM, Thomas Scheffler wrote: OK, I've looked inside QueryParser and it's seems to be the right place to do that. But it's rather complicated to transform a query to another, since QueryParserTokenManager as an extreme example is not quite understandable and needs a huge

Re: HTML tag filter...

2004-01-10 Thread Erik Hatcher
On Jan 10, 2004, at 1:43 PM, [EMAIL PROTECTED] wrote: would it be possible to implement a Analyser who filters HTML code out of a HTML page. As a result I would have only the text free of any tagging. The dilemma is that in a general sense there are multiple fields in HTML. At least title and

Re: merged search of document

2004-01-07 Thread Erik Hatcher
On Jan 7, 2004, at 4:18 PM, Dror Matalon wrote: Actually I would guess that performence should be fine. I would look at the code generated by the standard analyzer, http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/ standard/package-summary.html which translates from (a AND b)

Re: Retrieving the content from hits...

2004-01-05 Thread Erik Hatcher
Actually, creating a Field with a Reader means the field data is unstored. It is indexed, but the original text is not retrievable as it is not in the index (yes, it is tokenized, but not kept as a unit, and is very unlikely to be the same as the original text) If you need the text to be

Re: IndexHTML example on Jakarta Site

2004-01-02 Thread Erik Hatcher
On Jan 2, 2004, at 11:49 AM, Colin McGuigan wrote: 1. How do you specify which directory is to be searched ( I assumed it was the current directory ie tomcat\webapps but when I put in more searchable content nothing comes up in the search I have also tried typing java

Re: Query Parser AND / OR

2003-12-30 Thread Erik Hatcher
On Dec 30, 2003, at 3:13 PM, Morus Walter wrote: Hmm. That's be up to the developers. Don't know how many of them are reading lucene-user. I suspect we're all here! QueryParser is Lucene's red-headed step-child. It works well enough, but it has more than its share of issues. It is almost a

Re: Search all instead of specified/default field

2003-12-29 Thread Erik Hatcher
On Dec 29, 2003, at 5:37 PM, Thomas Krämer wrote: with the apache commons digester i can read each record into a lucene document and push each tag as key-value pair, where the tag name (eg. creator) is the lucene field name and the text enclosed by it the corresponding string value. for a lot

Re: IndexWriter Problem

2003-12-23 Thread Erik Hatcher
On Dec 23, 2003, at 8:15 AM, Niall Gallagher wrote: I think I have resolved the problem. I was using Lucene to index several directories concurrently within the same JVM, and as far as I can tell Lucene cannot do coucurrent indexing. Is this correct ? You can do it concurrently, but you must use

Re: efficient refinement, order by and range queries

2003-12-23 Thread Erik Hatcher
Geoffrey, You've done quite a thorough analysis of Lucene. I'll reply below with a few tidbits of Lucene trivia in hopes that will help On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote: One of our applications is a catalog search application.In this application our documents are

Re: Lucene and JavaHelp

2003-12-21 Thread Erik Hatcher
On Dec 19, 2003, at 10:46 PM, Mark R. Diggory wrote: Has anyone thought about or used Lucene to build an indexed, searchable help system? Either Server or Application Based? While maybe not exactly a help system, the application we wrote for Java Development with Ant uses Lucene to index a

Re: DoubleMetaphoneQuery

2003-12-19 Thread Erik Hatcher
Interestingly, I used a MetaphoneAnalyzer as an example in our book in progress. I'm curious if you have measured performance with doing it at analysis time versus query time. Enumerating all terms at query time is basically the same as doing a WildcardQuery or FuzzyQuery and involves a

Re: syntax of queries.

2003-12-19 Thread Erik Hatcher
On Friday, December 19, 2003, at 05:42 PM, Ernesto De Santis wrote: I have news questions: - apiQuery.add(new TermQuery(new Term(contents, dot)), false, true); new Term(contents, dot) The Term class, work for only one word? Careful with terminology here. It works for only one term. What is

Re: Wildcard in Field

2003-12-18 Thread Erik Hatcher
During indexing, perhaps you could glue all fields text together into one special field used for searching? On Thursday, December 18, 2003, at 06:31 AM, Thijs Cadier wrote: I am using a QueryParser, looked at the MultiFieldQueryParser. But the issue is that I don't know wich fields are in the

Re: Disabling modifiers?

2003-12-16 Thread Erik Hatcher
On Tuesday, December 16, 2003, at 05:46 AM, Iain Young wrote: Treating them as two separate words when quoted is indicative of your analyzer not being sufficient for your domain. What Analyzer are you using? Do you have knowledge of what it is tokenizing text into? I have created a custom

Re: Disabling modifiers?

2003-12-15 Thread Erik Hatcher
On Monday, December 15, 2003, at 12:12 PM, Iain Young wrote: A quick question. Is there any way to disable the - and + modifiers in the QueryParser? Not currently. I've had a bit of success by putting quotes around the offending names, (as suggested on this list), but the results are still

Re: syntax of queries.

2003-12-13 Thread Erik Hatcher
Try out the toString(fieldName) trick on your Query instances and pair them up with what you have below - this will be quite insightful for the issue - i promise! :) Look at my QueryParser article and search for toString on that page:

Re: Web Lucene Question.

2003-12-13 Thread Erik Hatcher
On Saturday, December 13, 2003, at 11:20 AM, Tun Lin wrote: Hi, I have tried to type the following at Windows command line at weblucene directory: ant build Everything seems to work fine except the following error: Everything works fine but it fails miserably?! :)

Re: Has anyone tried to implement a counter?

2003-12-12 Thread Erik Hatcher
display the contents of the hits object to a page, I am getting 57 or 58 results on the page. 5 or 6 more results than is shown from the length() method in the hits object. Shannon -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 12, 2003 9:45 AM

Re: Good and performance and fuzzy search

2003-12-10 Thread Erik Hatcher
On Wednesday, December 10, 2003, at 04:07 PM, julien gerard wrote: I'm attempting to optimize a fuzzy search on a big index with ~4.400.000 Documents ( lucene's meanning ) in 600.000 sub-categories (Simple Text.Keyword type a field ). My purpose is to limit the amount of documents on wich the

Re: Good and performance and fuzzy search

2003-12-10 Thread Erik Hatcher
On Wednesday, December 10, 2003, at 05:27 PM, julien gerard wrote: But in this case the fuzzy is performed on the overall index? The QueryFilter do his job after ? I'm not sure to understand the QueryFilter meaning? But I test the QueryFilter also this way and the time to doing this search

Re: OR query return fewer result than AND query

2003-12-08 Thread Erik Hatcher
On Sunday, December 7, 2003, at 09:50 PM, Fitrio Pakana wrote: I have similar problems with him, which is query using multiple terms, and to make things worse, the hits returned is quite absurd. The score of hits using 'OR' (any words) query is lower than if using 'AND' (all words) query, thus

Re: TooManyBooleanClauses exception

2003-12-08 Thread Erik Hatcher
On Monday, December 8, 2003, at 05:47 PM, [EMAIL PROTECTED] wrote: If I generate a query using QueryParser and a standard analyzer, in some cases I'm getting a TooManyBooleanClauses exception, e.g.: [2003-12-08 14:39:23] [ debug1 ] query is +glucose -kog* always:1 [2003-12-08 14:39:23]

Re: FSDIrectory.create doesn't tolerate subdirectories

2003-12-07 Thread Erik Hatcher
On Sunday, December 7, 2003, at 06:17 PM, Esmond Pitt wrote: When creating an index, FSDirectory assumes that the directory has no subdirectories. If a non-empty subdirectory is present, FSDirectory.create fails to delete it and throws an IOException. As the subdirectory is not a Lucene index

Re: FSDIrectory.create doesn't tolerate subdirectories

2003-12-07 Thread Erik Hatcher
On Sunday, December 7, 2003, at 08:21 PM, Esmond Pitt wrote: I'm not clear whether this is a 'yes' or a 'no'. I think other committers would need to weigh in on it. I'm fine with making a change to check isDirectory as well and not deleting them since Lucene (currently) does not work with

Re: Returning one result

2003-12-05 Thread Erik Hatcher
for Field.Keyword. Please provide more details on the issue you encountered using Field.Keyword. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2003 6:18 PM To: Lucene Users List Subject: Re: Returning one result You really should use a TermQuery

Re: implementing a TokenFilter for aliases

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 11:59 AM, Allen Atamer wrote: Below are the results of a debug run on the piece of text that I want aliased. The token spitline must be recognized as splitline i.e. when I do a search for splitline, this record will come up. 1: [173] , start:1, end:2 1: [missing]

Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 01:25 PM, Pleasant, Tracy wrote: Say ID is Ar3453 .. well the user may want to search for Ar3453, so in order for it to be searchable then it would have to be indexed and not a keyword. *arg* - we're having a serious communication issue here. My advice to you is

Re: Returning one result

2003-12-05 Thread Erik Hatcher
On Friday, December 5, 2003, at 04:28 PM, Dror Matalon wrote: Then I'm out of ideas. The next thing is for you to post your search code so we can see why it's not searching the field. Giving up so easily, Dror?! :)) The problem is, when using any type of QueryParser with a Keyword field, you

Re: NPE when using explain

2003-12-04 Thread Erik Hatcher
On Wednesday, December 3, 2003, at 08:51 PM, Dror Matalon wrote: Hits hits = initSearch(queryString); Does initSearch close the IndexSearcher? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Searching XML

2003-12-04 Thread Erik Hatcher
Here's to all those that inquire about searching XML with Lucene: http://www.tbray.org/ongoing/When/200x/2003/11/30/SearchXML :)) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: NPE when using explain

2003-12-04 Thread Erik Hatcher
On Thursday, December 4, 2003, at 02:46 PM, Dror Matalon wrote: Of course, now that I got explain to work I need to figure out what the following means :-) - Explanation:0.0 = product of: 0.0 = sum of: 0.0 = coord(0/5) - It means you have a bug in your code :))

Re: NPE when using explain

2003-12-04 Thread Erik Hatcher
On Thursday, December 4, 2003, at 03:07 PM, Dror Matalon wrote: By the way, all these fun things are going to be part of the CLI that I've been playing with. Anyone interested in helping test? Of course! Is it something you plan on donating to the Lucene project? LUKE and Limo and your CLI

Re: Returning one result

2003-12-04 Thread Erik Hatcher
You really should use a TermQuery in this case anyway, rather than using QueryParser. You wouldn't have to worry about the analyzer at that point anyway (and I assume you're using Field.Keyword during indexing). Erik On Thursday, December 4, 2003, at 05:01 PM, Pleasant, Tracy wrote: Ok I

Re: implementing a TokenFilter for aliases

2003-12-04 Thread Erik Hatcher
On Thursday, December 4, 2003, at 05:00 PM, Allen Atamer wrote: This is the code that I have so far for the next Method within AliasFilter. After reading some posts, I also got the idea to call setPositionIncrement(). Neither way works, because when I search for the alias, no search results

Re: Hits - how many documents?

2003-12-03 Thread Erik Hatcher
On Wednesday, December 3, 2003, at 09:36 AM, Ralph wrote: is there a maximum of documents Hits provide or is it unlimited (means limited to heap size of VM)? If there is a maximimum, what is the number? Hits represents all documents that matched the query (and optionally filtered). But, Hits

Re: Hits - how many documents?

2003-12-03 Thread Erik Hatcher
On Wednesday, December 3, 2003, at 10:16 AM, Ralph wrote: Does this mean Hits points to ALL documents and the last one might have a score of 0.0 ? If it does not contain all documents, where is the treshhold then? Or based on which condition it stops pointing to certain documents? I'm a bit

Re: Dates and others

2003-12-02 Thread Erik Hatcher
On Monday, December 1, 2003, at 11:55 PM, Tatu Saloranta wrote: On a related note, it would also be nice if there was a way to start categorizing general hot topics for Lucene developers; it seems like there are about half a dozen areas where there's lots of interest for improvements (most of

Re: New Lucene-powered Website

2003-12-02 Thread Erik Hatcher
On Tuesday, December 2, 2003, at 07:34 AM, Otis Gospodnetic wrote: Could you add a Lucene logo somewhere on your search results, as noted here: http://jakarta.apache.org/lucene/docs/powered.html ? I thought we were going to loosen up the requirement to have the logo on a search results page?

Re: New Lucene-powered Website

2003-12-02 Thread Erik Hatcher
On Tuesday, December 2, 2003, at 09:32 AM, Tate Avery wrote: Hello, This is the first time that I noticed this. Is the 'powered by Lucene' a legal requirement? Or just a suggestion? Does it apply to any system embedding Lucene (web pages, applications, etc)? That is not covered in the Apache

Re: Help with Searching indexes from a web app (Lucene 1.3 rc2)

2003-12-01 Thread Erik Hatcher
Also, reindex with the new API as well. There are likely incompatibilities in the index format. On Monday, December 1, 2003, at 11:21 AM, Iain Young wrote: Note, that I've just tried the example webapp supplied with Lucene, and I appear to be having exactly the same problem with that. The

Re: raw hit count

2003-11-30 Thread Erik Hatcher
On Sunday, November 30, 2003, at 11:13 AM, Kent Gibson wrote: as per Erik's idea I tried with the BitSet as follows: QueryFilter qf = new QueryFilter(query); IndexReader ir = IndexReader.open(indexPath); Searcher searcher2 = new IndexSearcher(ir); // get the bit set for the query BitSet bits =

Re: raw hit count

2003-11-29 Thread Erik Hatcher
I enjoy at least attempting to answer questions here, even if I'm half wrong, so by all means correct me if I misspeak On Saturday, November 29, 2003, at 06:37 PM, Kent Gibson wrote: All I would like to know is how many times a query was found in a particular document. I have no problems

Re: unexpected results from query

2003-11-26 Thread Erik Hatcher
On Tuesday, November 25, 2003, at 10:45 PM, marc wrote: Hi, assume a field has the following text Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) the following searches all return this document AMP AMP AMP; can someone explain this to me..i figured that only the first query

Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
woah that seems like an awfully complex answer to the question of how to tokenize at a comma rather than a space! %-) On Tuesday, November 25, 2003, at 11:48 AM, MOYSE Gilles (Cetelem) wrote: Hi. You should define expressions. To define expressions, you first have to define an

Re: Tokenizing text custom way

2003-11-26 Thread Erik Hatcher
On Wednesday, November 26, 2003, at 06:12 AM, Dragan Jotanovic wrote: You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index time out

Re: Search Question - not returning desired results

2003-11-26 Thread Erik Hatcher
On Wednesday, November 26, 2003, at 11:33 AM, Pleasant, Tracy wrote: Your website says: org.apache.lucene.analysis.standard.StandardAnalyzer: [xyz] [corporation] [EMAIL PROTECTED] [com] When I run it it keeps the entire email '[EMAIL PROTECTED] but according to your website it

Re: 1.2 javadoc

2003-11-24 Thread Erik Hatcher
On Monday, November 24, 2003, at 12:57 PM, [EMAIL PROTECTED] wrote: Is there a url that will take me to the javadocs for Lucene 1.2, rather than 1.3-rc2? No, but the 1.2 binary distribution ships with the javadocs, I believe. And, of course, they would be easy to generate from the 1.2 source

Re: Similarity class

2003-11-24 Thread Erik Hatcher
On Monday, November 24, 2003, at 12:22 PM, Ralf B wrote: One question: The similarity class is abstract. Are there default implementations like in other parts of this API (Analysers for example) available and how can I use it i.e. to calculate weights? Are there some default implementations

Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Saturday, November 22, 2003, at 06:33 PM, Dion Almaer wrote: 3. I have some fields suck as title, owner, etc as well as the content blob which I index and use as the default search field. Is there an easy way to extend the QueryParser to merge it with a MultiTermQuery which can also search

Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Sunday, November 23, 2003, at 03:33 PM, Dion Almaer wrote: This leads me to another issue actually. On certain range queries I get exceptions: Query: modifieddate:[1/1/03 TO 12/31/03] org.apache.lucene.search.BooleanQuery$TooManyClauses I'm guessing you're using Field.Keyword(String, Date)

Re: Dates and others

2003-11-23 Thread Erik Hatcher
On Sunday, November 23, 2003, at 03:33 PM, Dion Almaer wrote: 2. +field:foo and the QueryParser: I ran into some problems where using +field:foo was giving strange results. When I changed the queries to ... AND field:foo everything was fine. Am I missing something there? Which version of

Re: Dash Confusion in QueryParser - Bug? Feature?

2003-11-21 Thread Erik Hatcher
On Friday, November 21, 2003, at 02:34 PM, Jianshuo Niu wrote: I read  your post on lucene bug list. However, I try the change you suggested, but it just changed t-shirts to shirt. What Analyzer are you using? - To unsubscribe,

Re: Illegal seek error

2003-11-18 Thread Erik Hatcher
On Tuesday, November 18, 2003, at 04:32 PM, Dan Pelton wrote: Occasionally I get an Illegal seek error while loading a document into lucene. I am new to lucene so I am not sure what to look for. Does any one have an idea of what may cause this error. Can lucene handle multiple user inserting

Re: QueryParser Rules article (Erik Hatcher)

2003-11-16 Thread Erik Hatcher
On Sunday, November 16, 2003, at 06:23 PM, Tomcat Programmer wrote: Yes, I understand that now the QueryParser will trap the errors and convert to exceptions (with the version in CVS). I was just voicing my opinion regarding throwing TokenMgrError's in the first place when they should really be

Re: Slow response time with datefilter

2003-11-15 Thread Erik Hatcher
On Friday, November 14, 2003, at 07:16 PM, Dror Matalon wrote: We're seeing slow response time when we apply datefilter. A search that takes 7 msec with no datefilter takes 368 msec when I filter on the last fifteen days, and 632 msec on the last 30 days. Initially we saved doing

Re: AW: Slow response time with datefilter

2003-11-15 Thread Erik Hatcher
On Saturday, November 15, 2003, at 11:38 AM, Karsten Konrad wrote: If the number of different date terms causes this effect, why not round the date to the nearest or next midnight while indexing. Thus, filtering for the last 15 days would require walking over 15-17 different date terms. If

Re: AW: Slow response time with datefilter

2003-11-15 Thread Erik Hatcher
On Saturday, November 15, 2003, at 12:03 PM, Dror Matalon wrote: After posting the original email, I started wondering if that's the issue, the fact that we store timestamp up to the millisecond rather than a more reasonable granularity. Dates are too high a granularity for us, but minutes, and

Re: Slow response time with datefilter

2003-11-15 Thread Erik Hatcher
On Saturday, November 15, 2003, at 11:59 AM, Dror Matalon wrote: If this date range is pretty static, you could (in Lucene's CVS codebase) wrap the DateFilter with a CachingWrappingFilter. Or you could construct a long-lived instance of an equivalent QueryFilter and reuse it across multiple

Re: Query Filters on term A in query A AND (B OR C OR D)

2003-11-14 Thread Erik Hatcher
On Thursday, November 13, 2003, at 04:32 PM, Jie Yang wrote: Well, not quite, User normally enters a search string A that normally returns 1000 out of 2 millions docs. I then append A with 500 OR conditions... A AND (B or C or ... or x500). I am trying to optimse the 500 OR terms so that it does

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 01:13 PM, Chong, Herb wrote: if you didn't have to change the index then you haven't got all the factors needed to do it well. terms can't cross sentence boundaries and the index doesn't store sentence boundaries. You mean if you have text like this: Hello Herb.

Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:02 PM, Chong, Herb wrote: if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase capital gains tax hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to

Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:32 PM, Chong, Herb wrote: when people type in multiword queries, mostly they are interested in phrases in the linguistic sense. phrases don't cross sentence boundaries. you need certain features in the index and in the ranking algorithm to capture that

Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:54 PM, Chong, Herb wrote: it solves one part of the problem, but there are a lot of sentences in a typical document. you'll need to composite a rank of a document from its constituent sentences then. there are less drastic ways to solve the problem. the

Re: Index using URL

2003-11-14 Thread Erik Hatcher
You should write your own code that creates the Document objects with the fields you wish, with a Field.Keyword for the URL probably. Take what is useful from IndexHTML.java, but don't use it as-is. If you're speaking of pulling the document from a URL now you're talking of doing some HTTP

Re: Can use Lucene be used for this

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 03:22 AM, Hackl, Rene wrote: documents contain very long strings for chemical substances, users are interested in certain parts of the string e.g. find all documents that comprise *foo* be it 1-foo-bar or rab-oof-13-foonyl-naphthalene). So you're saying you want

Re: QueryParser Rules article (Erik Hatcher)

2003-11-13 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 11:52 PM, Tomcat Programmer wrote: When using the QueryParser class, the parse method will throw a TokenMgrError when there is a syntax error even as simple as a missing quote at the end of a phrase query. According to the javadoc, you should never see this

Re: Query Filters on term A in query A AND (B OR C OR D)

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 03:28 PM, Dan Quaroni wrote: To my knowledge the answer is No, lucene performs each query separately and then performs the joins after it has all the results. This is actually a rather serious problem when it comes to searches in large indexes where a single

Re: Query Filters on term A in query A AND (B OR C OR D)

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 04:07 PM, Jie Yang wrote: Erik, Just to make sure I understand you right, In an example query: ZipCode:CA10927 AND Gender:Male Are we talking about that query being entered by the user and you handing it just like that to QueryParser? If so, then QueryFilter

Re: Wildcard search and HOST tokens

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 05:55 AM, Pascal Nadal wrote: My lucene indexes contain fields with values like this www.xxx.yyy.zzz which are treated as HOST tokens. My problem is the following : search results never contain documents with such fields when doing a wildcard query or a fuzzy

Re: Wildcard search and HOST tokens

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 10:43 AM, Pascal Nadal wrote: the HostFilter I wrote (that tokenizes again HOST tokens) works wonderfully. I wonder if this has been fixed since Lucene 1.2 could you try the latest 1.3RC build available and see if it works without your HostFilter? Erik

Re: Boost in Query Parser

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 10:53 AM, MOYSE Gilles (Cetelem) wrote: Hello. I've made a Filter which recognizes special words and return them in a boosted form, in a QueryParser sense. For instance, when the filter receives special_word, it returns special_word^3, so as to boost it. The

Re: QueryParser Rules article (Erik Hatcher)

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 11:52 PM, Tomcat Programmer wrote: I thought Erik's article was great. There was one unanswered brainbender I had which I was hoping was in there, but... Maybe you can add this topic to the next one, Erik? Well, I'm not sure another article on QueryParser is

Re: fuzzy searches

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 02:37 PM, Thomas Krämer wrote: Is there an overview of the structure of the index of lucene despite of the javadoc or any other fast access to understanding what happens inside lucene? Here is what is inside a Lucene index:

Re: Can use Lucene be used for this

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 10:00 PM, Kumar Mettu wrote: The format of the file is as follows: Col1,col2,col3,Value abababc,xyzza,c,100 ababadx,xyz,adfdfd,101 I need to retrieve the value with simple queries on the data like: select value where col1

Re: crash in Lucene

2003-11-10 Thread Erik Hatcher
On Friday, November 7, 2003, at 08:38 AM, Chong, Herb wrote: i'm running in a single thread. the demo app is pretty vague on things and expects me to read the detailed documentation. not what i like in a sample application where someone is supposed to learn from it. taking the close() call out

Re: Rephrase My Question - How To Search Database With More Than One Pair of Property/Value as Parameters Using Lucene?

2003-11-07 Thread Erik Hatcher
On Friday, November 7, 2003, at 03:56 AM, Victor Hadianto wrote: Nonetheless, both creator and the name of the creator are variables. We depend on the user to give Of course, but you don't have unlimited fields right? So you know that creator field is the creator of a book. You can provide the

QueryParser Rules

2003-11-07 Thread Erik Hatcher
My latest article is now online at java.net: http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html Lot's of gory details about how QueryParser works and issues to consider when using it are discussed. Feedback (on java.net's site preferably) is most welcome! Erik

Re: Applet, read-only mode index/code contribution

2003-11-06 Thread Erik Hatcher
On Thursday, November 6, 2003, at 01:55 PM, Thomas Fuchs wrote: Hi, I used lucene 1.2/it.unige.csita.lucene.RODirectory inside an applet on CD-ROM. In lucene 1.3 the system property 'disableLuceneLocks' was introduced to make it.unige.csita.lucene.RODirectory or something like that obsolete.

Re: crash in Lucene

2003-11-06 Thread Erik Hatcher
On Thursday, November 6, 2003, at 02:44 PM, Chong, Herb wrote: it's the line with the close(). so the remedy then is to make sure that it is called only once. what is the recommended way to process two folders worth of documents then? do i need to create a new IndexWriter object for each

Re: Rephrase My Question - How To Search Database With More Than One Pair of Property/Value as Parameters Using Lucene?

2003-11-06 Thread Erik Hatcher
On Thursday, November 6, 2003, at 07:53 PM, Caroline Jen wrote: Hi, let me see if I have got the idea. For example, if I want to search the database for articles written by Elizabeth Castro, we do what is shown below in Lucene: It sounds like you're asking a lot of hypothetical questions without

Re: Index entire filesystem

2003-11-05 Thread Erik Hatcher
On Wednesday, November 5, 2003, at 03:51 AM, Marcel Stor wrote: Hi all, I'm thinkin' about writing a search tool for my filesystem. I know such things exist already but programming it myself is much more fun ;-) So, I would have Lucene crawl through my filesystem and pass each file to an

Re: crash in Lucene

2003-11-04 Thread Erik Hatcher
Could you try the latest CVS version or 1.3 RC build and see if the problem has been resolved? On Tuesday, November 4, 2003, at 12:24 PM, +ACI-Chong, Herb+ACI- wrote: this is the release 1.2 code. the exception as reported by debug is java.lang.NullPointerException at

Re: crash in Lucene

2003-11-04 Thread Erik Hatcher
600 bytes in size. Herb -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 04, 2003 1:42 PM To: Lucene Users List Subject: Re: crash in Lucene Could you try the latest CVS version or 1.3 RC build and see if the problem has been resolved

Re: zipf law?

2003-11-02 Thread Erik Hatcher
On Sunday, November 2, 2003, at 09:38 AM, Stefan Groschupf wrote: sorry a very stupid question does lucene zipf laws until indexing? I had to look up Zipfs law to understand this. Lucene does include frequency information about terms indexed, yes. And Analyzers can remove common words if you

Re: Remove a token from a field

2003-10-31 Thread Erik Hatcher
On Friday, October 31, 2003, at 03:53 AM, Albert Vila Puig wrote: Hi, Is there a way to remove a token from a document field entry?. For example, I've got a UnStored field in my index and I want to remove a token from this field without doing the delete and add document (because I'm

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
Field.Text(String, Reader) is an unstored field. It is indexed, but the contents are not stored in the index. If you want the contents stored, use Field.Text(String,String) Erik On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote: Hello, I want to add a Text field to a

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
Also, referring to my article may help - the code is designed to index text files: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html On Thursday, October 30, 2003, at 02:40 AM, Günter Kukies wrote: Hello, I want to add a Text field to a LUCENE Document. I checked the index

Re: Indexing txt-files

2003-10-30 Thread Erik Hatcher
,...) So my problem is that I don't get back the LUCENE-Document. Maby I need a buffered reader or it is not allowed to close the reader. Günter - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, October 30, 2003 9:17 AM Subject

Re: MultiFieldQueryParser default operator

2003-10-30 Thread Erik Hatcher
It was posted on lucene-dev, not lucene-user. I've pasted it below. I will be fixing this at some point in the near future based on this fix and other related ones needed. Erik On Thursday, October 30, 2003, at 09:31 AM, Otis Gospodnetic wrote: I believe a person just sent an email with a

Re: Best practice

2003-10-28 Thread Erik Hatcher
On Tuesday, October 28, 2003, at 08:54 AM, William W wrote: Is there any Lucene best practice ? Is there anything in particular you're interested in knowing about? This list and its archives contain in conjunction with the jGuru FAQ are the best sources for such info currently as well as

<    2   3   4   5   6   7   8   9   >