RE: Lock obtain timed out

2006-07-26 Thread Björn Ekengren
After some research on addShutdownHook it seems that Eclipse terminates program rather brutally giving neither finalize nor shutdownhook any chance to run. This is a known bug in Eclipse. The application I'm writing is a server that keeps a reader and a writer open at all time. I realized last n

Re: Web search demo does not work

2006-07-26 Thread Otis Gospodnetic
Get code from SVN to get some demo fixes after 2.0. Otis - Original Message From: John john <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, July 26, 2006 7:08:59 PM Subject: Web search demo does not work Hello, The web search demo does not work in lucene 2.0 bec

Web search demo does not work

2006-07-26 Thread John john
Hello, The web search demo does not work in lucene 2.0 because it seems that the files results.jsp use old methods. Is there a patch or something to fix this? or a file which is ok with lucene 2.0? I'd like to make some tests and I'm not familiar with JSP Thanks

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Doron Cohen
A document per row is seems correct to me too. If search would be by msisdn / messageid, - and if, as it seems, these are keywords, not free text that needs to be analyzed, they both should have Index.UNTOKENIZED. Also, since no search is to be done by the line content, the line should have Index.

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-26 Thread Doron Cohen
By the resulted query toString(), boolean query would not work correctly: qtxt: a foo [1] Multi Field Query (OR) : (title:a body:a) (title:foo body:foo) [2] Multi Field Query (AND): +(title:a body:a) +(title:foo body:foo) [3] Boolean Query : (title:a title:foo) (body:a body:foo) --> B

Re: Filter updating

2006-07-26 Thread Erick Erickson
Well, I *suppose* you could get the bitset from the pre-existing filter, copy it to the bitset for your new filter, and play with the bits at the end. I'm not sure how you get rid of your original filter if you use CachingWrapperFilter though. But As "the guys" have pointed out in oth

Re: Timestamps as milliseconds

2006-07-26 Thread Erick Erickson
As Miles said, use the DateTools (lucene) class with a DAY resolution. That'll give you a MMDD format, which won't blow your query with a "TooManyClauses" exception... Remember that Lucene deals with strings, so you want to store things in easily-manipulated string format, often one that'

Filter updating

2006-07-26 Thread Paul Waite
I was wondering if there was a nice way to add documents to a cached filter 'manually' as it were. The reason would be to avoid a complete refresh of the filter, if you already knew the docids of the extra documents to add. An example would be if I had a filter based on datetime, which contained

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
:) I probably did ask...my mind is turning into mush! Hehe Ok...let me write me an email analyzer. Thanks! Michael Otis Gospodnetic wrote: You most certainly want to index the whole token, and likely portions of it (didn't you already ask this a few weeks ago?). You will want to write y

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread Otis Gospodnetic
You most certainly want to index the whole token, and likely portions of it (didn't you already ask this a few weeks ago?). You will want to write your own Analyzer + Tokenizer that's email-address-format-aware and does things like: emit the whole token emit the username portion email the fully q

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
karl wettin wrote: On Wed, 2006-07-26 at 16:33 -0400, Michael J. Prichard wrote: If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to Tokenize that field? Do you want to match on the full address only, or on parts too? If A, don't tokenize. If B, tokenize. And

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread karl wettin
On Wed, 2006-07-26 at 16:33 -0400, Michael J. Prichard wrote: > If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to > Tokenize that field? Do you want to match on the full address only, or on parts too? If A, don't tokenize. If B, tokenize. And write an analyzer that wil

To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to Tokenize that field? doc.add(new Field("from", (String) itemContent.get("from"), Field.Store.YES, Field.Index.TOKENIZED)); -OR- doc.add(new Field("from", (String) itemContent.get("from"), Field.Store.YES, Field.Index

Matched part of query

2006-07-26 Thread Seeta Somagani
Hi all, Is there a way to return with a hit, the part of the query that matched in the corresponding document? I need to return combinations of documents that together contain the query or a relatively large part of the query. Thanks, Seeta --

RE: Newbie synonyms question

2006-07-26 Thread Lee, Andrew J \(CA - Toronto\)
Thanks, Otis. I think the SynonymAnalyzer is the way to go, injecting the synonyms while removing the stop words. Andrew -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 26, 2006 3:19 PM To: java-user@lucene.apache.org Subject: Re: Newbie synon

Re: Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
Michael J. Prichard wrote: Miles Barr wrote: Michael J. Prichard wrote: I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.a

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Namit Yadav
Thanks for the suggestions ! I am still talking about Indexing. The 'split the files' implementation was just an experiment as I wasn't able to get the kind of performance I needed from making rows as Documents. Now back to using rows as Documents: As Erick and Jeremy suggested, the MaxBufferedD

Re: Newbie synonyms question

2006-07-26 Thread Otis Gospodnetic
Hi Andrew, There is othing built into Lucene for synonyms, but you can grab the code from Lucene in Action to see how they can be handled (plus: http://www.lucenebook.com/search?query=synonyms for some context) Otis - Original Message From: "Lee, Andrew J (CA - Toronto)" <[EMAIL PROTE

Re: Lock obtain timed out

2006-07-26 Thread Michael McCandless
When I close my application containing index writers the lock files are left in the temp directory causing an "Lock obtain timed out" error upon the next restart. My guess is that you keep a writer open even though there is no activity involving adding new documents. Unless I have a massive neve

Re: email libraries

2006-07-26 Thread Andrzej Bialecki
John Haxby wrote: Suba Suresh wrote: Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Express and Unix emails?? javamail. Not sure how you get hold of the messages from Outlook Express, but getting hold of the MIME message in most Unix-based message

Re: email libraries

2006-07-26 Thread Suba Suresh
Ok. I will try it. I am a little stupid. When you said go down POP or IMAP route what did you mean? Is it for Unix/Linux alone that path? thanks, suba suresh. John Haxby wrote: Suba Suresh wrote: Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Expre

Re: email libraries

2006-07-26 Thread John Haxby
Suba Suresh wrote: Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Express and Unix emails?? javamail. Not sure how you get hold of the messages from Outlook Express, but getting hold of the MIME message in most Unix-based message stores is relativel

Re: Can Field types affect search speed?

2006-07-26 Thread Erick Erickson
nope, haven't had to worry about it yet ... Erick

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Erick Erickson
It feels to me like you're major problem might be file IO with all those files. There's no need to split the files up first and then index the files. Just read through the log and index each row. The code fragment you posted should allow you to get the line back from the "line" field of each docum

Re: Can Field types affect search speed?

2006-07-26 Thread Ryan O'Hara
I'm not using a Hits object. I'm using a HitCollector. I was just curious about whether Field settings could affect search performance. Any ideas? Ryan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: Lock obtain timed out

2006-07-26 Thread karl wettin
On Wed, 2006-07-26 at 16:24 +0200, Björn Ekengren wrote: > When I close my application containing index writers the > lock files are left in the temp directory causing an "Lock obtain > timed out" error upon the next restart. My guess is that you keep a writer open even though there is no activity

Re: Timestamps as milliseconds

2006-07-26 Thread Erick Erickson
two ideas: 1> store a second field that contains the time resolution you need, and sort by that. You can still search (quickly) by the day-resolution field. 2> If you KNOW that you are indexing the e-mails in time-order, then sorting by doc_id will preserve the time ordering. Erick

Re: Out of memory error

2006-07-26 Thread Suba Suresh
Sorry for my late response. It took us some time to run it again. We increased the memory heap to 1G as you suggested and it works. The indexer is not crashing. (We are running into some other problem with a powerpoint file .That is for another email). The code change with PDFTextStripper.wri

Re: Can Field types affect search speed?

2006-07-26 Thread Erick Erickson
Are you using a Hits object to iterate over the results? If so, you are re-executing the query every 100 docs or so under the covers, and if there are many results, this is very bad. If this is the case, you want to use a TopDocs or HitCollector to iterate through the entire result set. Of cours

email libraries

2006-07-26 Thread Suba Suresh
Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Express and Unix emails?? suba suresh. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Jeremy Bensley
Just my 0.02, but I think you are correct in creating one document per line in your database in order to achieve your desired result. In my experience, there are a few things that you might do differently : The MaxBufferedDocs parameter has a huge impact on indexing speed. The default of 10 is v

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Namit Yadav
Thanks all for the responses. I am very pleasently surprised at the helpful responses that I am getting. Okay, I think I still haven't understood Lucene well. I am sure that I am not solving the problem the right way. So I am explaining the problem at a very high level here .. please tell me what

Re: Timestamps as milliseconds

2006-07-26 Thread Miles Barr
Michael J. Prichard wrote: I guess the more I think about it I don't really care about the minutes in the initial. All that matters is the date (i.e. 2006-07-25). The only thing I would need the time for would be for sorting so I need to have that too. Ideas? Store as much detail as you

Can Field types affect search speed?

2006-07-26 Thread Ryan O'Hara
Currently, I have field "DOC" which is indexed, but not stored and not compressed. This is the field that users query. I also have a field "SYM" which is indexed and stored, but not compressed. For every document returned in a query, I need its symbol. Can field types (indexed vs. not i

Re: Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
Miles Barr wrote: Michael J. Prichard wrote: I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.add(new Field("date", (String)

Lock obtain timed out

2006-07-26 Thread Björn Ekengren
Ok, this might have been answered somewhere, but I can't find it so here goes: When I close my application containing index writers the lock files are left in the temp directory causing an "Lock obtain timed out" error upon the next restart. I works of course if I remove the locks manually inbe

Re: Timestamps as milliseconds

2006-07-26 Thread Miles Barr
Michael J. Prichard wrote: I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.add(new Field("date", (String) itemContent.get("da

Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.add(new Field("date", (String) itemContent.get("date").toString(), Field.Store

Re: RE : Re: index articles with groups

2006-07-26 Thread Erick Erickson
I think you're back to Karl's suggestion. Implement a HitCollector and ignore all hits on a group ID after the first one. You even get the most relevant article in the group that way ... Best Erick

Re: Limit number of search results

2006-07-26 Thread Miles Barr
headhunter wrote: I guess the recommended way to implement paging of results is to do your own query-results caching, right? Or does lucene also do this for me? The other guys have covered caching of results in a general way, so I won't go into that. For a search application I've written I

Method to speed up caching for faceted navigation

2006-07-26 Thread Johan Stuyts
Hi, I am working on faceted navigation. This is nothing new but I am anticpating an index that changes very frequently (every couple of seconds). After the index has been updated, I need to cache the bit sets of the facet values so I can do counting during searches later on. Because I need to get

RE : Re: index articles with groups

2006-07-26 Thread John john
Unfortunately this is not that easy. Because I must be able to retrieve only one article and if i index all the content in one document then all the document will be retrieved instead of the single article. Chris Hostetter <[EMAIL PROTECTED]> a écrit : : Then if I search for a word which is pr

Re: Limit number of search results

2006-07-26 Thread headhunter
Chris Hostetter wrote: > > [..] > > : In the first case: there is no uneccessary work. Lucene must look at > : every matching docId in order to determing which docs should be the > first > : 10. > [..] > Yes, you are right. Haven't thought of that :) 'Bout the second thing: You're right too.

RE: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Mike Streeton
The only way you might get the performance you want is to have multiple IndexWriters writing to different indexes and then addAll are the end. You would obviously have to handle the multi threading and distribution of the parts of the log to each writer. Mike www.ardentia.com the home of NetSearc

Re: Limit number of search results

2006-07-26 Thread Chris Hostetter
: I'm still a little worried about doing uneccesarry work - this is totally : different from what I know when working with DBMS. What are you describing as "uneccesarry work" examining every document even though you only care about the first 10, or re-executing the search when you want results 11

RE: Copying documents

2006-07-26 Thread Mike Streeton
Chris, Thanks for this I will have to do it the long hand way, we are trying to create "search marts" containing a smaller index from a much larger one, so cloning and deleting will not work. Thanks Mike www.ardentia.com the home of NetSearch -Original Message- From: Chris Hostetter

Re: Limit number of search results

2006-07-26 Thread headhunter
Hello Daniel, thank you for your answer. I'm still a little worried about doing uneccesarry work - this is totally different from what I know when working with DBMS. Johannes -- View this message in context: http://www.nabble.com/Limit-number-of-search-results-tf1998377.html#a5498842 Sent f

Re: Limit number of search results

2006-07-26 Thread Daniel Naber
On Mittwoch 26 Juli 2006 08:24, headhunter wrote: > Is it recommended to do the search again - discarding the uninteresting > values - because lucene caches the results, or just because lucene is so > damn fast? Lucene is fast enough in 99% of the cases. Caching is only done by the operating sys