Hi, How to index properly HTML documents? All the documents are HTML, some containing charaters encodid like ží ... Is there a character filter for filtering these codes? Is there a way to strip the HTML tags out? Does solr weight the terms in the document based on where they appear?.. words in headers (H1, H2,..) would be supposed to describe the document more then words in paragraphs.
Thanks for help, Georg