Re: indexing hierarchical data, schema design

jasimop Tue, 14 Jun 2011 05:13:46 -0700

Markus Jelsma-2 wrote:
> 
> ... i'd strongly suggest not to index 
> multiple entities into a single document.

Unfortunately this is not possible, there are other parties involved and I
cannot force
them to put one entity per page. All I can do for now is to use the
knowledge about the structure I have.

I thought it is pretty common that even if one builds a fulltext index using
nutch/solr, one would like
to preserve some kind of information about the original structure.

Having a look at the available plugins i found that the feed plugin should
do what I need as its parser
returns more than one document. This is what I plan to implement. I would
like to split the
document to be parsed into several documents, one per entity. From each of
them I can then read out
the desired values and fill the index fields. Then also the search queries
should become easy as one
index entry contains information about one entity.

Before I get started changing my current plugin (it currently implements
HtmlParseFilter but it seemt I need to implement Parse) I would like to ask
you if this seems a possible solution? Are there
any pitfalls or tricks I should be aware of?

And another question:
FeedParser.java within the feed plugin contains a main() method, but how can
I execute it? It seems simpler to test during development using this method
than building the plugin and crawl/index all again.
Building the plugin with ant I cannot execute it, even if i manually change
the Manifest file to contain the Main-Class attribute. How can I execute it
with all libraries and dependencies in place?

--
View this message in context:
http://lucene.472066.n3.nabble.com/indexing-hierarchical-data-schema-design-tp3052894p3062775.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: indexing hierarchical data, schema design

Reply via email to