Re: Non-linear structure for search and index documents

Chris Hostetter Thu, 16 Apr 2009 15:02:59 -0700

: I need index/search words extracted from pdf files with coordinates and page
: number, so I have this  structure:
: 
:    - index the document id
:    - a document has many pages
:    - a page has many words
:    - a word has geometry[w,h,x,y](inside of page)
: 
: Is this possible with solr?   
: If yes, how the best way to do that? Is using field collapsing?


it's possible, but Solr doesn't currenlty have any features that make it 
*easy*.  

the main things you have to ask yourself, before deciding what the best 
way approach this problem is, are: what do i want to be ableto do with 
this data?


if you need to search for docs where "dog" appears inside a certain 
x1,y1,x2,y2 box then you have to structure your index much differntly then 
if you just need to find all docs containing "dog" and then as part or 
your result get the w,h,x,y coordinates for each instance of the word.

The main Lucene feature that's probably going to be at the core of any 
work like this is "Payloads" ... but there'sgoing to be a signifficant 
about of java coding needed to take advantage of it in any of the ways i 
can think of that you might be wanting.


-Hoss

Re: Non-linear structure for search and index documents

Reply via email to