Indexing HTML document

György Frivolt Tue, 02 Mar 2010 08:07:30 -0800

Hi, How to index properly HTML documents? All the documents are HTML, some
containing charaters encodid like &#x17E;&#xED; ... Is there a character
filter for filtering these codes? Is there a way to strip the HTML tags out?
Does solr weight the terms in the document based on where they appear?..
words in headers (H1, H2,..) would be supposed to describe the document more
then words in paragraphs.


Thanks for help,

   Georg

Indexing HTML document

Reply via email to