Re: Matching accented with non-accented characters

2006-07-25 Thread John Haxby
Rajan, Renuka wrote: I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages. The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches. The one idea, albeit

Re: email libraries

2006-07-26 Thread John Haxby
Suba Suresh wrote: Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Express and Unix emails?? javamail. Not sure how you get hold of the messages from Outlook Express, but getting hold of the MIME message in most Unix-based message stores is relativel

Re: email libraries

2006-07-30 Thread John Haxby
Andrzej Bialecki wrote: Just for the record - I've been using javamail POP and IMAP providers in the past, and they were prone to hanging with some servers, and resource intensive. I've been also using Outlook (proper, not Outlook Express - this is AFAIK impossible to work with) via a Java-COM

Re: Best Practice: emails and file-attachments

2006-08-15 Thread John Haxby
lude wrote: does anybody has an idea what is the best design approch for realizing the following: The goal is to index emails and their corresponding file attachments. One email could contain for example: I put a fair amount of thought into this when I was doing the design for our mail server -

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
lude wrote: Hi John, thanks for the detailed answer. You wrote: If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does this mean you index just the first file-attachment? What do you advice, if you have to

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
lude wrote: You also mentioned indexing each bodypart ("attachment") separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: [snip] 3.) The result list would show this: 1. mail-1 'subject' 'Abstract of the messa

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby
/gif application/msword the indenting indicates nesting. A message isn't just a bodypart followed by attachments, it has structure like a file system. Something which escapes most mail readers. Sigh. John Haxby wrote: lude wrote: You also mentioned indexing each bodypart ("

Re: DateTools again

2006-10-02 Thread John Haxby
Volodymyr Bychkoviak wrote: I'm using DateTools with Resolution.DAY. I know that dates internally are converted to GMT. Converting dates "2006-10-01 00:00" and "2006-10-01 15:00" from "Etc/GMT-2" timezone will give us "20060930" and "20061001" respectively. But these dates are identical with

Re: DateTools again

2006-10-02 Thread John Haxby
John Haxby wrote: I ran across the problem with DateTools not using UTC when I tried to use an index created in California from the UK: I was looking for documents with a particular date stamp but I found documents with a date stamp from the wrong day. Even more interesting and bizarre

Re: DateTools again

2006-10-03 Thread John Haxby
Volodymyr Bychkoviak wrote: User has an input (javaScript calendar) on page where he can choose some date to include in search. Search resolution is day resolution. If user will enter same date in different time of date he will get different results (because calendar will also set current hour

Re: Spam filter for lucene project

2006-10-06 Thread John Haxby
Rajiv Roopan wrote: Hello, I'm currently running a site which allows users to post. Lately posts have been getting out of hand. I was wondering if anyone knows of an open source spam filter that I can add to my project to scan the posts (which are just plain text) for spam? spamassassin shoul

Re: Searching by bit masks

2006-11-10 Thread John Haxby
Larry Taylor wrote: What we need to do is to be able to store a bit mask specifying various filter flags for a document in the index and then search this field by specifying another bit mask with desired filters, returning documents that have any of the specified flags set. In other words, we are

Re: Websphere and Dark Matter

2007-01-16 Thread John Haxby
Rollo du Pre wrote: We have a scenario where a web search app using Lucene causes Websphere 5.1 allocated memory to grow but not shrink. JProfiler shows the heap shrinks back ok, leaving the JVM with over 1GB allocated to the jvm but only 400MB in use. Websphere does not perform a level 2 garbage

Re: Websphere and Dark Matter

2007-01-22 Thread John Haxby
Nadav Har'El wrote: On Tue, Jan 16, 2007, Rollo du Pre wrote about "Re: Websphere and Dark Matter": I was hoping it would, yes. Does websphere not release memory back to the OS when it not longer needs it? I'm concerned that if the memory spiked for some reason (indexing a large document) th

Slightly off-topic: using openoffice for conversions

2007-01-29 Thread John Haxby
Hello All, In LIA, Erik and Otis mention using the openoffice.org API for converting from various formats to something that can be used for indexing. Does anyone have any examples of doing this that they'd be willing to share? jch -

Re: Building lucene index using 100 Gb Mobile HardDisk

2007-02-05 Thread John Haxby
maureen tanuwidjaja wrote: Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD.. Damien McCarthy <[EMAIL PROTECTED]> wrote: FAT 32 imposes a lower file size limitation than NTF. Attempts to create files greater that 4Gig on FAT32 will throw error you are seeing. No

Re: Clearing locks

2007-03-06 Thread John Haxby
MC Moisei wrote: Is there a easy way to clear locks ? If I redeploy my war file and it happens that there is an indexing happening the lock is not cleared. I know I can tell JVM to run the finalizers before it exits but in this case the JVM is not exiting being a hot deploy. I'd do this by ha

Re: Index a source, but not store it... can it be done?

2007-03-09 Thread John Haxby
Chris Hostetter wrote: i'm not crypto expert, but i imagine it would probably take the same amount of statistical guess work to reconstruct meaningful info from either approach (hashing hte individual words compared to eliminating the positions) so i would think the trade off of supporting phrase

Re: index word files ( doc )

2007-03-26 Thread John Haxby
Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abiword is an editor, but it also do

Re: index word files ( doc )

2007-03-26 Thread John Haxby
John Haxby wrote: Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abi

Re: index word files ( doc )

2007-03-28 Thread John Haxby
Daniel Noll wrote: The only screenshots I can see look like plain text to me, and I'm currently working on something which needs to convert Word to HTML, which is why I ask. wvWare, which I mentioned earlier, can convert word to HTML and does a pretty good job of maintaining formatting. abiwor

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby
karl wettin wrote: The way I see it (and probably many other) mailing lists are suprior in many ways, especially when following multiple forums. It's true. Any forum that I need to subscribe to I find an RSS feed for so that I can get mail messages. Forums are a pain in the neck once you'

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby
Grant Ingersoll wrote: I like the mailing list approach much better. With a good set of rules and folders in place (which takes about 15 minutes to setup), one can easily manage large volumes of mail w/o batting an eye, whereas forums require large amounts of navigation, IMO. Glad I'm not th

Re: why Apache doesnt create a nice forum like the others???

2007-03-29 Thread John Haxby
Mohammad Norouzi wrote: I registered in Nabble, but to post message you should subscribe to lucene mailing list and if you subscribe to mailing list your inbox will become full of messages. this is very bad!!! You're using gmail aren't you? Why don't you set up a filter to handle mail from th

Re: Searching an NTFS File Server

2005-04-14 Thread John Haxby
Maher Martin wrote: * The user's access rights would be read from Active Directory (i.e windows group membership, etc) * On the submission of a query to Lucene - the user / group access rights would be appended as required search criteria and Lucene would filter out all results that the user should

Re: Update performance/indexwriter.delete()?

2005-04-15 Thread John Haxby
Roy Klein wrote: Here's the scenario that I can't guarantee won't happen: There might be 3 transactions in a very short time span (for example, 1 second), here's what they are: 1) update doc1 (DEL doc1, ADD doc1) 2) update doc2 (DEL doc2, ADD doc2) 3) delete doc1 If I process these in order, then a

Re: Top most frequent words

2005-05-12 Thread John Haxby
Otis Gospodnetic wrote: Somebody asked about this today, and I just found this through Simpy: http://www.unine.ch/info/clef/ Scroll half-way through the page, look on the right side: 1,000 most frequent words for several languages. Hmm. I'm not sure how valuable that is. For English "los" a

Re: NFS

2005-05-18 Thread John Haxby
Otis Gospodnetic wrote: I haven't used Lucene with NFS. My understanding is that the problem is with lock files when they reside on the NFS server. Yes, you can change the location of lock files with a system property, but if you are using NFS to make the index accessible from multiple machines,

Re: NFS

2005-05-18 Thread John Haxby
Paul Libbrecht wrote: Le 18 mai 05, à 11:51, John Haxby a écrit : I haven't tried this, but under Linux (at least), you can specify the "nolock" parameter to make file locking appen locally. Of course, this will make it impossible to use NFS to share the index among several ma

Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread John Haxby
Chris Collins wrote: Ok that part isnt surprising. However only about 1% of 30% of the merge was spent in the OS.flush call (not very IO bound at all with this controller). On Linux, at least, measuring the time taken in OS.flush is not a good way to determine if you're I/O bound -- all tha

Re: Search for documents where field does not exist?

2005-06-20 Thread John Haxby
Erik Hatcher wrote: On Jun 17, 2005, at 5:54 PM, [EMAIL PROTECTED] wrote: Please do not reply to a post on the list if your message actually isn't a reply. Post a new message instead. Sorry about that.. wasn't intentional.. clicked reply to get the reply address and then forgot to change

Re: performance: gcj, sun, ibm ?

2005-08-04 Thread John Haxby
Martin Rode wrote: hello all, lucene is already pretty fast, but i was wondering if you guys have experience with using gcj (on linux). how much faster is it for indexing? personally i have best performance with java-ibm, at least under linux. it would be interesting to hear how your exper

Re: lucene and UTF-8

2005-09-29 Thread John Haxby
John Cherouvim wrote: I'm having some problems indexing my UTF-8 html pages. I am running lucene on Linux and I cannot understand why does the index generated depends on the locale of my operating system. If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this to en_US the inde

Re: Cache index in RAMDirectory and evict

2006-01-12 Thread John Haxby
Kan Deng wrote: 1. Performance. Since all the cached disk data resides outside JVM heap space, the access efficiency from Java object to those cached data cannot be too high. True, but you need to compare the relative speeds. If data has to be pulled from a file, then you're talking se

Re: Memory

2006-01-17 Thread John Haxby
Aigner, Thomas wrote: I did a man on top and sure enough there was a PPID command on Linux (f then B) for parent process. And yes, they always have the same parent command. Thanks for your help as I'm obviously still a noob on Unix. Nope, that doesn't tell you they're different thre

Re: Range queries

2006-01-24 Thread John Haxby
Erik Hatcher wrote: 2. How do I search for negative numbers in a range. For example field:[-3 TO 2] ? I don't mind hacking code such that my numbers are indexed as +0001 and -0001 and then I can override the query parser to change my query to [-003 TO +002]. However.. "+"

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? I've been looking

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with �k�� ISOLatin1AccentFilter.removeAccents() converts that string to "duzenlediğimiz kampanyamıza"

Re: encoding

2006-01-27 Thread John Haxby
petite_abeille wrote: I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) [snip] [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/u

Re: Performance and FS block size

2006-02-12 Thread John Haxby
Otis Gospodnetic wrote: I'm somewhat familiar with ext3 vs. ReiserFS stuff, but that's not really what I'm after (finding a better/faster FS). What I'm wondering is about different block sizes on a single (ext3) FS. If I understand block sizes correctly, they represent a chunk of data that th

Re: Performance and FS block size

2006-02-13 Thread John Haxby
Andrzej Bialecki wrote: None of you mentioned yet the aspect that 4k is the memory page size on IA32 hardware. This in itself would favor any operations using multiple of this size, and penalize operations using amounts below this size. For normal I/O it will rarely make any difference at al

Re: Encryption

2006-05-06 Thread John Haxby
George Washington wrote: Is it possible to reconstruct a complete source document from the data stored in the index, even if the fields are only indexed but not stored? Because if the answer is "yes" there is no point in encrypting, unless the index itself can be encrypted. Is it feasible to e

Re: indexing emails

2006-06-19 Thread John Haxby
ng for "lucene.apache.org" to look for mail sent to the lucene lists, they search for me variously as "jch", "john haxby" and "haxby"; they even, occasionally, search for complete mail addresses. They all work. The RFC2047 syntax in the example above gives

Re: indexing emails

2006-06-19 Thread John Haxby
Michael J. Prichard wrote: We are actually grabbing emails by becoming part of the SMTP stream. This part is figured out and we have archived over 600k emails into a mysql database. The problem is that since we currently store the blobs in the DB this databases are getting large and searching