Re: complex XML structure problem

Saša Mutić Fri, 03 Oct 2008 11:47:02 -0700

Hi Otis,

You assumption is correct if I would only display highlighted results from
Lucene/SOLR. However, If I want to display words highlighted on image (hence
those coordinate attributes that I want to index) I need to know which word
should be highlighted, as there might be several instances of same word in
document (out of which only proximity one would be correct).



On Fri, Oct 3, 2008 at 6:30 PM, Otis Gospodnetic <[EMAIL PROTECTED]
> wrote:

> Hola Saša,
>
>
> You don't have to recreate logic for proximity (I assume that by that you
> mean proximity of words/terms for phrase queries), if you have a text field
> with all your content.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Saša Mutić <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, October 2, 2008 3:43:33 PM
> > Subject: Re: complex XML structure problem
> >
> > Bok Otis,
> >
> > I was thinking about this approach, but was wondering if there is more
> > elegant approach where I wouldn't have to recreate logic for proximity
> and
> > quoted complex queries (identification of neighbor hits and quote queries
> > for highlighting and positioning on image).
> >
> > If nobody comes up with better approach, I will use something similar as
> you
> > described.
> >
> > Thanks for fast response :)
> >
> > Kind Regards,
> > Saša
> >
> >
> > On Thu, Oct 2, 2008 at 5:51 PM, Otis Gospodnetic
> > > wrote:
> >
> > > Bok Saša,
> > >
> > > It sounds like you need to keep per-word metadata, plus the raw content
> so
> > > you can full-text search it.
> > > If so, consider keeping the meta data elsewhere - e.g. different index,
> > > external DB, etc.
> > > For full-text search you probably want to index the full content,
> something
> > > like:
> > >
> > > article
> > > Une date..........
> > > 123
> > >
> > >
> > > You could create another index with words and each word Document have
> an ID
> > > of their "parent" (e.g. the article's ID), so you do a query against
> the
> > > above index, get the IDs of matches, and then get words for those
> matches.
> > >  Of course, you can also use a RDBMS or some other storage for the
> second
> > > part.
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > >
> > > ----- Original Message ----
> > > > From: Saša Mutić
> > > > To: solr-user@lucene.apache.org
> > > > Sent: Thursday, October 2, 2008 6:14:14 AM
> > > > Subject: complex XML structure problem
> > > >
> > > > Hello,
> > > >
> > > > I would appreciate any suggestions on solving following problem:
> > > >
> > > > I'm trying to index newspaper. After processing logical structure and
> > > > articles, I have similar structure to this...
> > > >
> > > >
> > > > date="18560301">
> > > >
> > > > type="TEXT" cont="0"/>
> > > >
> > > > type="TEXT" cont="0"/>
> > > >
> > > > type="TEXT" cont="0"/>
> > > > ...
> > > >
> > > > date="18560301">
> > > >
> > > > type="ADVERTISEMENT" cont="0"/>
> > > > ...
> > > >
> > > > Obviously, I would like to have all the benefits of full-text search
> with
> > > > proximity and other advanced options.
> > > > After going through SCHEMA.XML and docs, I can see that I should
> split
> > > each
> > > > "word" into something like this...
> > > >
> > > >         ARTICLE
> > > >         201
> > > >         5
> > > >         6
> > > >         18560301
> > > >         Une
> > > >         1137
> > > >         147
> > > >         1665
> > > >         951
> > > >         1
> > > >         TEXT
> > > >         0
> > > >
> > > >
> > > > However, if I use this approach, it seems like I lost some core
> > > > functionality of search...
> > > >
> > > > - multiword searching ? For example searching for "Une date" ? Since
> each
> > > > word is treated as standalone document ?
> > > >
> > > > - Proximity search ?
> > > >
> > > > ... and so on.
> > > >
> > > > So I guess this approach isn't solution to my goal. Does anyone have
> some
> > > > recommendations on how to solve this ?
> > > >
> > > > Goal would be to receive results that would have mentioned
> "attributes"
> > > for
> > > > each hit...so for previous example "Une date", I would receive hits
> with
> > > all
> > > > attributes that would allow me to correctly position them on image
> > > (t,l,b,r
> > > > as coordinates for example).
> > > >
> > > > Kind Regards,
> > > >
> > > > Sasha
> > >
> > >
>
>

Re: complex XML structure problem

Reply via email to