: I need index/search words extracted from pdf files with coordinates and page
: number, so I have this structure:
:
:- index the document id
:- a document has many pages
:- a page has many words
:- a word has geometry[w,h,x,y](inside of page)
:
: Is this possible with solr?
: If yes, how the best way to do that? Is using field collapsing?
it's possible, but Solr doesn't currenlty have any features that make it
*easy*.
the main things you have to ask yourself, before deciding what the best
way approach this problem is, are: what do i want to be ableto do with
this data?
if you need to search for docs where "dog" appears inside a certain
x1,y1,x2,y2 box then you have to structure your index much differntly then
if you just need to find all docs containing "dog" and then as part or
your result get the w,h,x,y coordinates for each instance of the word.
The main Lucene feature that's probably going to be at the core of any
work like this is "Payloads" ... but there'sgoing to be a signifficant
about of java coding needed to take advantage of it in any of the ways i
can think of that you might be wanting.
-Hoss