Re: highlighting and fragments

2007-09-21 Thread Michael J. Prichard
pike. Maybe looking at indexes 10x that. Thoughts? -Michael Erick Erickson wrote: Out of curiosity, how big is huge? And how many documents and fields? And a silly question, are you storing your fields or not (i.e. Field.Store.NO Erick On 9/20/07, Michael J. Prichard <[EMAIL PROTECTED

highlighting and fragments

2007-09-20 Thread Michael J. Prichard
Hello Folks, I wanted to stay away from storing text in the indexes in order to keep them smaller. I have a requirement now though to provide highlighting and, more so, fragments of the content so they will be displayed on the UI. Do you all prefer to store the text in the index to make this

Large Index Architecture

2007-08-29 Thread Michael J. Prichard
Hello All, I want to hear from those out there that have large (i.e. 50 GB+) indexes on how they have designed their architecture. I currently have an index for email that is 10 GB and growing. Right now there are no issues with it but I am about to get into an even bigger use for the softw

Re: Seeking Advice

2007-08-15 Thread Michael J. Prichard
I actually know from experience. Around 20% +/- 5% of emails will have attachments. If that helps. Again, I say index as much info as you can. Store what you think it necessary. Erick Erickson wrote: Rather than use efficiency arguments to drive the behavior of the app, I'd recommend that

Re: Seeking Advice

2007-08-15 Thread Michael J. Prichard
Hey Michael, Are you writing this software for yourself or for reselling? We built an email archiving service and we use lucene as our search engine. We approach this a little differently. BUT, i don't think it is wasteful to index the header information with the attachment. Just don't st

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Michael J. Prichard
Yea, I have seen those. I guess the question is what do you all use to extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and so on? This is what I use now to extract english. Thanks, Michael testn wrote: If you can extract token stream from those files already, you can simp

Search Design Question

2007-03-23 Thread Michael J. Prichard
Hello All, We allow our users to search through our index with a simple textfield. The search phrase has "content" as its default value. This allows them to search quickly through content but then when they type "to:blah AND from:foo AND content:boogie" it will know to parse,etc. What I wa

Re: When to use HitCollector?

2007-01-07 Thread Michael J. Prichard
afoul of TooManyClauses exceptions. The default is 1,024 but you can make it as big as memory/time allows. And, as you say, this is temporary until you reconstruct your index. If this is totally irrelevant, perhaps you could add some more detail Best Erick On 1/7/07, Michael J. Prichard <

When to use HitCollector?

2007-01-07 Thread Michael J. Prichard
I have an index which has email and their attachments indexed. This is ok but the issue I am having it when I am trying to filter the searches. For example I can search the content of the email and the document (i.e. the attachment) and return the right results. Basically, if it is a documen

Re: DateTools oddity....

2006-10-18 Thread Michael J. Prichard
Dang it :) Anyway to set timezone? Emmanuel Bernard wrote: DateTools use GMT as a timezone Tue Aug 01 21:15:45 EDT 2006 Wed Aug 02 02:15:45 EDT 2006 Michael J. Prichard wrote: When I run this java code: Long dates = new Long("1154481345000"); Date dada

DateTools oddity....

2006-10-18 Thread Michael J. Prichard
When I run this java code: Long dates = new Long("1154481345000"); Date dada = new Date(dates.longValue()); System.out.println(dada.toString()); System.out.println(DateTools.dateToString(dada, DateTools.Resolution.DAY)); I get this output: Tue Aug 01 21:15:45 EDT 2006 200608

java.io.IOException: term out of order --> HELP

2006-10-03 Thread Michael J. Prichard
We get this when trying to optimize index: Exception in thread "main" java.io.IOException: term out of order at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:95) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:305) at org.apache.lucene.index.SegmentM

"Greater than" equivalent?

2006-09-25 Thread Michael J. Prichard
I have a filtering process that checks my index for various things. I have an "itemid" field in this index and I keep track of the last itemid I search up to. I was wondering if there was an equivalent to doing a search with a "greater than" clause? Sort of like: to:[EMAIL PROTECTED] AND su

Search w/o looking at synonyms?

2006-08-06 Thread Michael J. Prichard
Howdy, I created some indexes that use a SynonymAnalyzer and now I want to be able to offer a choice as to search the synonyms or not. If I search now it will find all docs since the analyzer created tokens in the same position. How do I tell my IndexSearcher to not look at those tokens wit

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-08-04 Thread Michael J. Prichard
Chris Hostetter wrote: : Sure I would love to! Can you ping me at [EMAIL PROTECTED] and : let me know what I need to do? Do I just post it to JIRA? instructions on submitting code can be found in the wiki.. http://wiki.apache.org/jakarta-lucene/HowToContribute note in particular that since

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Michael J. Prichard
Steven Rowe wrote: Michael J. Prichard wrote: Hey Otis, Sure I would love to! Can you ping me at [EMAIL PROTECTED] and let me know what I need to do? Do I just post it to JIRA? Thanks, Michael Otis Gospodnetic wrote: A good place for that in JIRA. could you put it there? We

Filters or BooleanQuery

2006-07-31 Thread Michael J. Prichard
This is more of a design question. I have a ton of email that is indexed. I need to search based on a date range so I use a RangeQuery added to a BooleanQuery to search. This works. Now I need to include another clause that will narrow the result even more. AND on top of that I will need s

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Michael J. Prichard
so if you are okay with putting Apache license on top of the source code, we can include it there. Same for EmailAnalyzer. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:37:57 PM Subject: Re: EM

Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-31 Thread Michael J. Prichard
Awesome! Thanks! Otis Gospodnetic wrote: Or simpler: wr = new IndexWriter(indexDir, aWrapper, !IndexReader.indexExists(indexDir)); - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:35:29 PM Subje

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-30 Thread Michael J. Prichard
:) That JavaMail API is good for getting the whole email, but you then need to chop it up with your EmailAnalyzer, so you're doing the right thing. Otis - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, July 29,

Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-30 Thread Michael J. Prichard
Instead of catching the IOException, you may want to use !IndexReader.indexExists(...) in place of that boolean param to IndexWriter ctor. Otis - Original Message ---- From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, July 29, 2006 4:04:2

Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-29 Thread Michael J. Prichard
Hey Erik, Will do. May I ask why? Out of curiousity. Thanks, Michael Erik Hatcher wrote: I think you should use a new instance of each analyzer for each field, not reuse instances. Other than that, your usage is fine. Erik On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote

Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-29 Thread Michael J. Prichard
Oh my...disregard this question. It works...I was instantiating my IndexWriter before setting up my Analyzers!! Dangit...I feel a little dumb. I just switched the order and put the instantiated indexwriter last...it works. Thanks, Michael P.S. I feel somewhat silly! Michael J. Prichard

PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-29 Thread Michael J. Prichard
So I have the following code... // let's get our SynonymAnalyzer SynonymAnalyzer synAnalyzer = getSynonymAnalyzer(); // let's get our EmailAnalyzer EmailAnalyzer emailAnalyzer = getEmailAnalyzer(); // set up perfieldanalyzer PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new Sta

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-29 Thread Michael J. Prichard
Hasan Diwan wrote: Michael: On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote: Howdynot sure if anyone else wants this but here is my first attempt at writing an analyzer for an email address...modifications, updates, fixes welcome. Why reinvent the wheel? Se

EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-28 Thread Michael J. Prichard
Howdynot sure if anyone else wants this but here is my first attempt at writing an analyzer for an email address...modifications, updates, fixes welcome. -- EmailAnalyzer import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.Lower

Indexing large sets of documents?

2006-07-27 Thread Michael J. Prichard
I built an indexer that runs through email and its attachments, rips out content and what not and then creates a Document and adds it to an index. It works w/ no problem. The issue is that it takes around 3-5 seconds per email and I have seen up to 10-15 seconds for email w/ attachments. I n

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
you'll want to also index [EMAIL PROTECTED] even if an email address looks like [EMAIL PROTECTED] Otis - Original Message ---- From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, July 26, 2006 4:33:10 PM Subject: To Tokenize or Un_Tokeniz

Re: To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
karl wettin wrote: On Wed, 2006-07-26 at 16:33 -0400, Michael J. Prichard wrote: If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to Tokenize that field? Do you want to match on the full address only, or on parts too? If A, don't tokenize. If B, tok

To Tokenize or Un_Tokenize?

2006-07-26 Thread Michael J. Prichard
If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to Tokenize that field? doc.add(new Field("from", (String) itemContent.get("from"), Field.Store.YES, Field.Index.TOKENIZED)); -OR- doc.add(new Field("from", (String) itemContent.get("from"), Field.Store.YES, Field.Index

Re: Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
Michael J. Prichard wrote: Miles Barr wrote: Michael J. Prichard wrote: I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows

Re: Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
Miles Barr wrote: Michael J. Prichard wrote: I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.add(new Field("date"

Timestamps as milliseconds

2006-07-26 Thread Michael J. Prichard
I am working on indexing emails and have stored the data as milliseconds. I was thinking of using a filter w/ my search that would only return the email in that data range. I am currently indexing as follows: doc.add(new Field("date", (String) itemContent.get("date").toString(), Field.Store

Re: Building easy to use search guis? How to save queries...

2006-07-18 Thread Michael J. Prichard
That is really cool. But I am looking for something that I could save and then recreate. I am thinking of building an XML representation such as: or something similar. I just want to see if anyone has done something like this before even up to th

Re: Lucene index database

2006-07-12 Thread Michael J. Prichard
Ha Erick, we must have sent our responses at the same time :) What Erick said :) Erick Erickson wrote: This has been extensively discussed in the mail archive, I think a search of the archive would help you a lot. The short form is no. There's nothing built into Lucene to help you index a

Re: Lucene index database

2006-07-12 Thread Michael J. Prichard
Hey there Teresa. Short answer: Not directly. Long answer: Lucene is a set of libraries built for indexing text and then searching those indexes. Not sure what you mean by indexing a database per se. You could write some code to get the records you want from the database and then index tho

Re: indexing emails

2006-06-19 Thread Michael J. Prichard
nitials or first names and last name still need a PrefixQuery or WildcardQuery, if you want to search for last names, but it does make some queries possible which would otherwise blow up. -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: 16 June 20

Unique indexes?

2006-06-18 Thread Michael J. Prichard
Is there anything like a unique key for lucene indexes? For example, say I want to have unique ItemID's in my index...do I need to check for that before insert or can I lock it down with Lucene's API? - To unsubscribe, e-mail:

Re: indexing emails --> mutliple "to" emails, setting position same

2006-06-18 Thread Michael J. Prichard
So I have emails with multiple recipients (of course, this is very common). I currently put them all on the same string seperated by space and then tokenize them with Standard Analyzer. I was looking into SynonymAnalyzers and see that you can drop multiple tokens with the same position. Woul

Re: indexing emails

2006-06-18 Thread Michael J. Prichard
From: karl wettin [mailto:[EMAIL PROTECTED] Sent: 16 June 2006 21:13 To: java-user@lucene.apache.org Subject: Re: indexing emails On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote: I am working on indexing emails and want to have a "to" field. I am currently putting all the

indexing emails

2006-06-16 Thread Michael J. Prichard
I am working on indexing emails and want to have a "to" field. I am currently putting all the emails on one line seperated w/ spaces...example: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] Then i index that with a StandardAnalyzer as follows: doc.add(new Field("to", (String) itemCont

Re: Document design and analyzer questions?

2006-06-13 Thread Michael J. Prichard
Hey Chris, Thanks for the response. Chris Hostetter wrote: : Question is two fold. One, here is the layout I was thinking: my rule of thumb: if a field is going to contain less then a few dozen bytes (ie: a date, an email address, etc) you might as well store it ... it will make your life ea

Document design and analyzer questions?

2006-05-31 Thread Michael J. Prichard
Hello, I will try this again I am working on a system that will index emails and their attachments. I have all the pieces working that parse the documents and I am now working on the actual indexing part. I would like to have synonym searching as well. Question is two fold. One, here

Document design, analyzer questions?

2006-05-30 Thread Michael J. Prichard
Hello, I am working on a system that will index emails and their attachments. I have all the pieces working that parse the documents and I am now working on the actual indexing part. I would like to have synonym searching as well. Question is two fold. One, here is the layout I was thinki