I think you essentially have to do much of the same work either
way, so take whatever comes easiest. Personally, I think
that pre-processing the data (and using two fields) would be
easiest, but it's up to you.

Using a custom analyzer would involve collecting all the contents,
deciding what is "relevant" and emitting those tokens one by one.
The advantage here (and it's not very important) is that you'd only
need one field as Grant said.

The other approach would be to read the contents into a buffer,
apply whatever business logic you determine to remove the
irrelevant text, and then submitting this to the normal analyzers.
The advantage here is that it's a simpler flow. Analyzers are
usually just used for breaking up an incoming stream and doing
specific transformations (stop words, stemming, etc). These
transformations are pretty context-less. Extending that process
to handle complex rules about what's "relevant" is a bit of a stretch.
But if you do pre-process the data, storing the input won't be what you
want and you'll need to store the original text in a separate field.

Best
Erick



On Mon, Feb 16, 2009 at 10:05 AM, Johnny X <jonathanwel...@gmail.com> wrote:

>
> Basically I'm working on the Enron dataset, and I've already de-duplicated
> the collection and applied a spam filter. All the e-mails after this have
> been parsed to XML and each field (so To, From, Date etc) has been
> separated, along with one large field for the remaining e-mail content
> (called Content).
>
> So yes, to answer your question. Bearing in mind though this still
> represents around 240, 000ish files to compute.
>
> I have no idea about Solr analyzers/search components, but my theory was
> that I'd need an analyzer to remove 'banner-like' content from being
> indexed
> and a search component to identify 'corporate-like' information in the
> content of the e-mails.
>
> What is a business logical solution and how will that work?
>
>
> Thanks.
>
>
>
> zayhen wrote:
> >
> > I would go for a business logic solution and not a Solr customization in
> > this case, as you need to filter information that you actually would like
> > to
> > see in diferent fields on your index.
> >
> > Did you already tried to split the email in several fields like subject,
> > from, to, content, signature, etc etc etc ?
> >
> >
> > 2009/2/16 Johnny X <jonathanwel...@gmail.com>
> >
> >>
> >> Hi there,
> >>
> >>
> >> I was told before that I'd need to create a custom search component to
> do
> >> what I want to do, but I'm thinking it might actually be a custom
> >> analyzer.
> >>
> >> Basically, I'm indexing e-mail in XML in Solr and searching the
> 'content'
> >> field which is parsed as 'text'.
> >>
> >> I want to ignore certain elements of the e-mail (i.e. corporate
> banners),
> >> but also identify the actual content of those e-mails including
> corporate
> >> information.
> >>
> >> To identify the banners I need something a little more developed than a
> >> stop
> >> word list. I need to evaluate the frequency of certain words around
> words
> >> like 'privileged' and 'corporate' within a word window of about 100ish
> >> words
> >> to determine whether they're banners and then remove them from being
> >> indexed.
> >>
> >> I need to do the opposite during the same time to identify, in a similar
> >> manner, which e-mails include corporate information in their actual
> >> content.
> >>
> >> I suppose if I'm doing this I don't want what's processed to be indexed
> >> as
> >> what's returned in a search, because then presumably it won't be the
> full
> >> e-mail, so do I need to store some kind of copy field that keeps the
> full
> >> e-mail and is fully indexed to be returned instead?
> >>
> >> Can what I'm suggesting be done and can anyone direct me to a guide?
> >>
> >>
> >> On another note, is there an easy way to destroy an index...any custom
> >> code?
> >>
> >>
> >> Thanks for any help!
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22031139.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Alexander Ramos Jardim
> >
> >
> > -----
> > RPG da Ilha
> >
>
> --
> View this message in context:
> http://www.nabble.com/Word-Locations---Search-Components-tp22031139p22038912.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Reply via email to