I reply to myself because I founded the mistake. The italian stopwords file
that I founded on apache site contains on the same line of each stopword a
comment shell style, the stopwords tokenizer probably is basical and doesn't
accept comments on the same line of stopwords. I dropped them and
I'm using Lucid Imagination installation kit for SOLR (the last one with SOLR
1.4).
I would like to use stopwords, and I installed in
LucidWorks/lucidworks/solr/conf/stopwords.txt the italian version of the
file.
Moreover the field where I want to clean stopwords is declared in schema.xml
as
remember if it's in the solr 1.4 release.) With
this you can save the pdf binary in one field and save the extracted
text in another field. I'm doing this now with html.
On Tue, Feb 9, 2010 at 2:08 AM, alendo alessandra.donn...@uniroma2.it
wrote:
Ok I'm going ahead (may be:).
I tried
I understand that tika is able to index pdf content: its true? I tried to
post a pdf from local and I've seen in the solr/admin schema browser another
document, but when I search only the document id is available, the documents
doesn't seem indexed. Do I need other products to index pdf content?
Ok I'm going ahead (may be:).
I tried another curl command to send the file from remote:
http://mysolr:/solr/update/extract?literal.id=8514stream.file=files/attach-8514.pdfstream.contentType=application/pdf
and the behaviour has been changed: now I get an error in solr log file:
HTTP