Your persian pdf problem is different, and already taken care of in pdfbox trunk
https://issues.apache.org/jira/browse/PDFBOX-1127 On Tue, Oct 4, 2011 at 2:04 PM, ahmad ajiloo <ahmad.aji...@gmail.com> wrote: > I have this problem too, in indexing some of persian pdf files. > > 2011/10/4 Héctor Trujillo <hecto...@gmail.com> > >> Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But >> with >> some files I’ve got problems because they stored estrange characters. I got >> stored this content: >> +++++++ >> >> Starting a Search Application >> >> >> Abstract >> >> Starting >> a Search Application A Lucid Imagination White Paper ¥ April 2009 Page i >> >> >> Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 >> Page ii Do You Need Full-text Search? >> >> ∞ >> >> ∞ >> ∞ >> >> Starting >> a Search Application A Lucid Imagination White Paper ¥ April 2009 Page 1 >> >> Identifying >> Ideal Results >> >> Starting >> a Search Application A Lucid Imagination White Paper ¥ April 2009 Page 2 >> >> Starting >> a Search Application A Lucid Imagination White Paper >> >> >> +++++++ >> >> But if I open the pdf file I have no problem to see the content correctly. >> >> I think this is a question of the charset encoding, but I don't know if I >> can avoid this behaviour with a different analyzer o tokenizer to be >> applied >> in indexing time, may be. >> >> I've got this problem with some documents downloaded from Lucid's Web. >> >> >> >> I don't know if some have had the same problem and know how to solve this. >> >> Thanks >> >> Best regards >> > -- lucidimagination.com