Kumar,

you'll have to make your own documents with after parsing yourself the HTML (e.g. with Nekohtml to dom). As for the weights of tokens, supplementarily to IDF, you can do that per field, i.e. when you add a field into the document.

paul


Le 28-mai-09 à 12:22, Gaurav Kumar a écrit :

Hi everyone,

I am doing a project using Lucene where i need to index HTML files. I am using Tika to parse HTML files. But i need to index files according to their tags which means that every text present in different HTML tag (like <p> <a>) should be stored in different fields. Can i do that. If yes how? Also can i assign different weightage to the tokens present in different fields.
If yes how?

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to