RE: When Perth in Thika some of the characters in the body is continuous. Re: Apache Tikaで、PDFの本文内の文字が連続する現象発生

Allison, Timothy B. Thu, 15 Sep 2016 05:17:33 -0700

I just tested this with PDFBox 2.0.3-rc1 (which should be released soon), and I 
got this:


物性目录的用法(6)  关于耐药品性, 耐热水性, 耐湿热性 DB


So, I think this problem will be fixed in the next version of Tika.  After we 
upgrade to 2.0.3 you can also get a nightly build.


-----Original Message-----
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 12:15 PM
To: question.answer...@gmail.com
Cc: user@tika.apache.org
Subject: When Perth in Thika some of the characters in the body is continuous. 
Re: Apache Tikaで、PDFの本文内の文字が連続する現象発生

PDF is, in a number of places, PDF characters are continuous.

Characters in the PDF is, "(4)? 于流? Of DB" to have been written but,

When you get the content in Tika,
  "4) (4) (4) (4)? ??? 于流 于流 于流 于流? ??? Sex of sexual DB DDBB DB"

become.

Why, an increasing number of characters, or the same character in a row 
followed by the?


How to solve is, how can I do?


Or cause, because the characters are marked with bold and underscore the PDF?


■ attach a file to the reproduction.

-- 
question.answer...@gmail.com <question.answer...@gmail.com>



> PDFは、多数の箇所で、文字が連続するPDF。
> 
> PDF内の文字は、「(4)   ?于流?性DB」と 書かれているが、
> 
> Tikaでcontentを取得すると、
>  「4)  (4)  (4)  (4)  ? ?? ?于流 于流 于流 于流? ?? ?性 性性 性 DB DDBB DB」
> 
> とになる。
> 
> なぜ、文字が増えて、連続で同じ文字が続くのか？
> 
> 
> 解決する方法は、どうすればよいのか？
> 
> 
> 原因は、PDFの太字やアンダーバーの付いている文字だからか？
> 
> 
> ■上記再現するファイルを添付する。
> 
> 
> -- 
> Tika初心者 question.answer...@gmail.com 
> 
> 
> 
> > Again, relying on google translate.  Y, I would think that suppressing 
> > overlapping characters should solve this problem.  Try pure PDFBox, and if 
> > the problem is there, try asking on the PDFBox list.
> > 
> > 
> > いきなりですが、表記件についてご質問させてください。
> > 
> > Javaで、Apache Tikaで、PDFのパース処理をしています。
> > ほとんどのPDFは、正常に、読み込めるのですが、パースエラーになったり、
> > パースできても、本文内の文字が連続する現象発生します。
> > 
> > ここで、お聞きしたいのは、「本文内の文字が連続する現象」の原因と対策方法です。
> > パースで取り出した長文の中から同じようなパターンの一部を下記へ抜粋。
> > 
> > ⇒ 「(1)(1)(1)(1)風風風風林火林火林火林火山山山山用用用用DBDBDBDB」
> > 
> > おそらく、PDFの「(1)風林火山用DB」が書かれている部分をTikaが
> > 取り出したときに、
> > PDFのコメント？、アクセシビリティ？、何かしら、普通に開いた時には見えないが、
> > PDFに埋め込まれているもの？をTikaがパースで取り出したのでは？と考えています。（想像）
> > 
> > ソース：
> > -----------------------------------------------------
> > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > AutoDetectParser(); ContentHandler handler = new 
> > BodyContentHandler(Integer.MAX_VALUE);
> > Metadata metadata = new Metadata();
> > parser.parse(new FileInputStream(document), handler, metadata
> >                                                                          , 
> > new ParseContext()); String plainText = handler.toString(); 
> > System.out.println(plainText);
> > -----------------------------------------------------
> > 
> > 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> > 
> > 
> > 
> > また、上記でだめでしたので、
> > どうやら、文字が連続する場所は、太字やアンダーバーがあるので、
> > 下記のソースへ改造しましたが、結果が全く変わりません。
> > 何か、お気づきの問題点などや解決策はありますでしょうか？
> > 
> > ソース：
> > -----------------------------------------------------
> > File document = new File("/usr/local/sample.pdf"); PDFParser parser = new 
> > PDFParser(); PDFParserConfig config = new PDFParserConfig();
> > 
> > // 太字などを文字を重ねることで表現している場合における重複文字を無視す
> > るかどうか ＝ 無視したい
> > config.setSuppressDuplicateOverlappingText(true);
> > 
> > // テキスト下線などを無視するかどうか ＝ 無視したい
> > config.setExtractAnnotationText(false);
> > 
> > parser.parse(new FileInputStream(document), handler, metadata, new 
> > ParseContext());
> > 
> > String plainText = handler.toString();
> > System.out.println(plainText);
> > -----------------------------------------------------
> > 
> > 
> > Tika初心者
>

RE: When Perth in Thika some of the characters in the body is continuous. Re: Apache Tikaで、PDFの本文内の文字が連続する現象発生

Reply via email to