pike. Maybe looking at indexes 10x that.
Thoughts?
-Michael
Erick Erickson wrote:
Out of curiosity, how big is huge? And how many documents and
fields?
And a silly question, are you storing your fields or not (i.e.
Field.Store.NO
Erick
On 9/20/07, Michael J. Prichard <[EMAIL PROTECTED
Hello Folks,
I wanted to stay away from storing text in the indexes in order to keep
them smaller. I have a requirement now though to provide highlighting
and, more so, fragments of the content so they will be displayed on the UI.
Do you all prefer to store the text in the index to make this
Hello All,
I want to hear from those out there that have large (i.e. 50 GB+)
indexes on how they have designed their architecture. I currently have
an index for email that is 10 GB and growing. Right now there are no
issues with it but I am about to get into an even bigger use for the
softw
I actually know from experience. Around 20% +/- 5% of emails will have
attachments. If that helps. Again, I say index as much info as you
can. Store what you think it necessary.
Erick Erickson wrote:
Rather than use efficiency arguments to drive the behavior of the
app, I'd recommend that
Hey Michael,
Are you writing this software for yourself or for reselling? We built
an email archiving service and we use lucene as our search engine. We
approach this a little differently.
BUT, i don't think it is wasteful to index the header information with
the attachment. Just don't st
Yea, I have seen those. I guess the question is what do you all use to
extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and
so on? This is what I use now to extract english.
Thanks,
Michael
testn wrote:
If you can extract token stream from those files already, you can simp
Hello All,
We allow our users to search through our index with a simple textfield.
The search phrase has "content" as its default value. This allows them
to search quickly through content but then when they type "to:blah AND
from:foo AND content:boogie" it will know to parse,etc.
What I wa
afoul of
TooManyClauses exceptions. The default is 1,024 but you can make it as
big
as memory/time allows. And, as you say, this is temporary until you
reconstruct your index.
If this is totally irrelevant, perhaps you could add some more detail
Best
Erick
On 1/7/07, Michael J. Prichard <
I have an index which has email and their attachments indexed. This is
ok but the issue I am having it when I am trying to filter the
searches. For example I can search the content of the email and the
document (i.e. the attachment) and return the right results. Basically,
if it is a documen
Dang it :)
Anyway to set timezone?
Emmanuel Bernard wrote:
DateTools use GMT as a timezone
Tue Aug 01 21:15:45 EDT 2006
Wed Aug 02 02:15:45 EDT 2006
Michael J. Prichard wrote:
When I run this java code:
Long dates = new Long("1154481345000");
Date dada
When I run this java code:
Long dates = new Long("1154481345000");
Date dada = new Date(dates.longValue());
System.out.println(dada.toString());
System.out.println(DateTools.dateToString(dada,
DateTools.Resolution.DAY));
I get this output:
Tue Aug 01 21:15:45 EDT 2006
200608
We get this when trying to optimize index:
Exception in thread "main" java.io.IOException: term out of order
at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:95)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:305)
at
org.apache.lucene.index.SegmentM
I have a filtering process that checks my index for various things. I
have an "itemid" field in this index and I keep track of the last itemid
I search up to. I was wondering if there was an equivalent to doing a
search with a "greater than" clause? Sort of like:
to:[EMAIL PROTECTED] AND su
Howdy,
I created some indexes that use a SynonymAnalyzer and now I want to be
able to offer a choice as to search the synonyms or not. If I search
now it will find all docs since the analyzer created tokens in the same
position. How do I tell my IndexSearcher to not look at those tokens
wit
Chris Hostetter wrote:
: Sure I would love to! Can you ping me at [EMAIL PROTECTED] and
: let me know what I need to do? Do I just post it to JIRA?
instructions on submitting code can be found in the wiki..
http://wiki.apache.org/jakarta-lucene/HowToContribute
note in particular that since
Steven Rowe wrote:
Michael J. Prichard wrote:
Hey Otis,
Sure I would love to! Can you ping me at [EMAIL PROTECTED] and
let me know what I need to do? Do I just post it to JIRA?
Thanks,
Michael
Otis Gospodnetic wrote:
A good place for that in JIRA. could you put it there? We
This is more of a design question. I have a ton of email that is
indexed. I need to search based on a date range so I use a RangeQuery
added to a BooleanQuery to search. This works. Now I need to include
another clause that will narrow the result even more. AND on top of
that I will need s
so if you are okay with putting Apache license
on top of the source code, we can include it there. Same for EmailAnalyzer.
Otis
- Original Message
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, July 30, 2006 1:37:57 PM
Subject: Re: EM
Awesome! Thanks!
Otis Gospodnetic wrote:
Or simpler:
wr = new IndexWriter(indexDir, aWrapper, !IndexReader.indexExists(indexDir));
- Original Message
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, July 30, 2006 1:35:29 PM
Subje
:)
That JavaMail API is good for getting the whole email, but you then need to
chop it up with your EmailAnalyzer, so you're doing the right thing.
Otis
- Original Message
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29,
Instead of catching the
IOException, you may want to use !IndexReader.indexExists(...) in place of that
boolean param to IndexWriter ctor.
Otis
- Original Message ----
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, July 29, 2006 4:04:2
Hey Erik,
Will do. May I ask why? Out of curiousity.
Thanks,
Michael
Erik Hatcher wrote:
I think you should use a new instance of each analyzer for each
field, not reuse instances. Other than that, your usage is fine.
Erik
On Jul 29, 2006, at 3:49 PM, Michael J. Prichard wrote
Oh my...disregard this question. It works...I was instantiating my
IndexWriter before setting up my Analyzers!! Dangit...I feel a little
dumb. I just switched the order and put the instantiated indexwriter
last...it works.
Thanks,
Michael
P.S. I feel somewhat silly!
Michael J. Prichard
So I have the following code...
// let's get our SynonymAnalyzer
SynonymAnalyzer synAnalyzer = getSynonymAnalyzer();
// let's get our EmailAnalyzer
EmailAnalyzer emailAnalyzer = getEmailAnalyzer();
// set up perfieldanalyzer
PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new
Sta
Hasan Diwan wrote:
Michael:
On 7/28/06, Michael J. Prichard <[EMAIL PROTECTED]> wrote:
Howdynot sure if anyone else wants this but here is my first attempt
at writing an analyzer for an email address...modifications, updates,
fixes welcome.
Why reinvent the wheel? Se
Howdynot sure if anyone else wants this but here is my first attempt
at writing an analyzer for an email address...modifications, updates,
fixes welcome.
-- EmailAnalyzer
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Lower
I built an indexer that runs through email and its attachments, rips out
content and what not and then creates a Document and adds it to an
index. It works w/ no problem. The issue is that it takes around 3-5
seconds per email and I have seen up to 10-15 seconds for email w/
attachments. I n
you'll want to also index [EMAIL PROTECTED] even if an email address looks like
[EMAIL PROTECTED]
Otis
- Original Message ----
From: Michael J. Prichard <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, July 26, 2006 4:33:10 PM
Subject: To Tokenize or Un_Tokeniz
karl wettin wrote:
On Wed, 2006-07-26 at 16:33 -0400, Michael J. Prichard wrote:
If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to
Tokenize that field?
Do you want to match on the full address only, or on parts too?
If A, don't tokenize.
If B, tok
If I want to search an email address (i.e. [EMAIL PROTECTED]) do I need to
Tokenize that field?
doc.add(new Field("from", (String) itemContent.get("from"),
Field.Store.YES, Field.Index.TOKENIZED));
-OR-
doc.add(new Field("from", (String) itemContent.get("from"),
Field.Store.YES, Field.Index
Michael J. Prichard wrote:
Miles Barr wrote:
Michael J. Prichard wrote:
I am working on indexing emails and have stored the data as
milliseconds. I was thinking of using a filter w/ my search that
would only return the email in that data range. I am currently
indexing as follows
Miles Barr wrote:
Michael J. Prichard wrote:
I am working on indexing emails and have stored the data as
milliseconds. I was thinking of using a filter w/ my search that
would only return the email in that data range. I am currently
indexing as follows:
doc.add(new Field("date"
I am working on indexing emails and have stored the data as
milliseconds. I was thinking of using a filter w/ my search that would
only return the email in that data range. I am currently indexing as
follows:
doc.add(new Field("date", (String) itemContent.get("date").toString(),
Field.Store
That is really cool. But I am looking for something that I could save
and then recreate. I am thinking of building an XML representation such as:
or something similar. I just want to see if anyone has done something
like this before
even up to th
Ha Erick,
we must have sent our responses at the same time :)
What Erick said :)
Erick Erickson wrote:
This has been extensively discussed in the mail archive, I think a
search of
the archive would help you a lot.
The short form is no. There's nothing built into Lucene to help you
index a
Hey there Teresa.
Short answer: Not directly.
Long answer: Lucene is a set of libraries built for indexing text and
then searching those indexes. Not sure what you mean by indexing a
database per se. You could write some code to get the records you want
from the database and then index tho
nitials
or first names and last name still need a PrefixQuery or
WildcardQuery, if
you want to search for last names, but it does make some queries
possible
which would otherwise blow up.
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: 16 June 20
Is there anything like a unique key for lucene indexes? For example,
say I want to have unique ItemID's in my index...do I need to check for
that before insert or can I lock it down with Lucene's API?
-
To unsubscribe, e-mail:
So I have emails with multiple recipients (of course, this is very
common). I currently put them all on the same string seperated by space
and then tokenize them with Standard Analyzer. I was looking into
SynonymAnalyzers and see that you can drop multiple tokens with the same
position. Woul
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: 16 June 2006 21:13
To: java-user@lucene.apache.org
Subject: Re: indexing emails
On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
I am working on indexing emails and want to have a "to" field. I am
currently putting all the
I am working on indexing emails and want to have a "to" field. I am
currently putting all the emails on one line seperated w/ spaces...example:
[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]
Then i index that with a StandardAnalyzer as follows:
doc.add(new Field("to", (String) itemCont
Hey Chris,
Thanks for the response.
Chris Hostetter wrote:
: Question is two fold. One, here is the layout I was thinking:
my rule of thumb: if a field is going to contain less then a few dozen
bytes (ie: a date, an email address, etc) you might as well store it ...
it will make your life ea
Hello,
I will try this again
I am working on a system that will index emails and their attachments.
I have all the pieces working that parse the documents and I am now
working on the actual indexing part. I would like to have synonym
searching as well.
Question is two fold. One, here
Hello,
I am working on a system that will index emails and their attachments.
I have all the pieces working that parse the documents and I am now
working on the actual indexing part. I would like to have synonym
searching as well.
Question is two fold. One, here is the layout I was thinki
44 matches
Mail list logo