from:"question.answer...@gmail.com"

Apache Tikaで、PDFの本文内の文字が連続する現象発生

2016-09-14 Thread question.answer...@gmail.com

皆様、始めまして。

Tika初心者です。

いきなりですが、表記件についてご質問させてください。

Javaで、Apache Tikaで、PDFのパース処理をしています。
ほとんどのPDFは、正常に、読み込めるのですが、パースエラーになったり、
パースできても、本文内の文字が連続する現象発生します。

ここで、お聞きしたいのは、「本文内の文字が連続する現象」の原因と対策方法です。
パースで取り出した長文の中から同じようなパターンの一部を下記へ抜粋。

⇒ 「(1)(1)(1)(1)林火林火林火林火DBDBDBDB」

おそらく、PDFの「(1)風林火山用DB」が書かれている部分をTikaが
取り出したときに、
PDFのコメント？、アクセシビリティ？、何かしら、普通に開いた時には見えないが、
PDFに埋め込まれているもの？をTikaがパースで取り出したのでは？と考えています。（想像）

ソース：
-
File document = new File("/usr/local/sample.pdf");
Parser parser = new AutoDetectParser(); 
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
-

原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？



また、上記でだめでしたので、
どうやら、文字が連続する場所は、太字やアンダーバーがあるので、
下記のソースへ改造しましたが、結果が全く変わりません。
何か、お気づきの問題点などや解決策はありますでしょうか？

ソース：
-
File document = new File("/usr/local/sample.pdf");
PDFParser parser = new PDFParser();
PDFParserConfig config = new PDFParserConfig();

// 太字などを文字を重ねることで表現している場合における重複文字を無視す
るかどうか ＝ 無視したい
config.setSuppressDuplicateOverlappingText(true);

// テキスト下線などを無視するかどうか ＝ 無視したい
config.setExtractAnnotationText(false);

parser.parse(new FileInputStream(document), handler, metadata, new 
ParseContext());

String plainText = handler.toString();
System.out.println(plainText);
-


Tika初心者

Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

2016-09-14 Thread question.answer...@gmail.com

皆様、始めまして。

Tika初心者です。

いきなりですが、表記件についてご質問させてください。

Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしているのですが、
これは、仕様でしょうか？
設定などで回避して文字化けなしで取り込む方法はありますでしょうか？
  ※保護されていないPDFは、文字化けなく、取り込めます。

原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？


ソース：
-
File document = new File("/usr/local/sample.pdf");
Parser parser = new AutoDetectParser(); 
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
-


補足：
・保護されたPDFは、手動でテキストのコピーができない。


Tika初心者

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com

こんにちは。

困っております。

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。

原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？

■読み込むと文字化けするhtmlを添付します。
  ※EUCコードのファイルです。（秀丸エディタの判定では）

ソース：
-
File document = new File("/usr/local/sample.pdf");
Parser parser = new AutoDetectParser(); 
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
-


-- 
技術初心者
Title: ÆüËÜ¸ì¤Ç¤âÊ¸»ú²½¤±¤·¤Æ¤·¤Þ¤¹¤Î¤Çº¤¤ë

	
	
ÆüËÜ¸ì¤À¤±¤É¡¢Ê¸»ú²½¤±¤¹¤ë¡£

訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com

Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。

※先程のメールに添付したのは、秀丸エディタで保存したときに、
   文字コードが変わったようで、文字化けしません。

ー
こんにちは。

困っております。

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。

原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？

■読み込むと文字化けするhtmlを添付します。
  ※EUCコードのファイルです。（秀丸エディタの判定では）

ソース：
-
File document = new File("/usr/local/sample.pdf");
Parser parser = new AutoDetectParser(); 
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
-


-- 
技術初心者
Title: ²ãÑ¹±¡Æ¬ÈÝÆ÷

	
	
²ãÑ¹±¡Æ¬ÈÝÆ÷

Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com

Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS,
UTF-8, etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.

私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString();
> 
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: user@tika.apache.org
> Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>文字コードが変わったようで、文字化けしません。
> 
> ー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> 
> ■読み込むと文字化けするhtmlを添付します。
>   ※EUCコードのファイルです。（秀丸エディタの判定では）
> 
> ソース：
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> -- 
> 技術初心者

Re: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

2016-09-14 Thread question.answer...@gmail.com

Do you, says the text of the protected PDF files can not be parsed by Tika?
I, if the specification of Tika, you give up the Perth.
(あなたは、保護されたPDFファイルのテキストをTikaでパースできないと言って
いますか？
私は、Tikaの仕様なら、パースを諦めます。)


Is the specification of Tika?


-- 
question.answer...@gmail.com 



> Relying on google translate...  I'm not sure how protection could lead to 
> garbled text; if the file is password protected, you shouldn't get any text.
> 
> 
> Try troubleshooting with pure PDFBox:
> 
> https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems
> 
> 
> -Original Message-----
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 7:22 AM
> To: user@tika.apache.org
> Subject: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている
> 
> 皆様、始めまして。
> 
> Tika初心者です。
> 
> いきなりですが、表記件についてご質問させてください。
> 
> Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしているのですが、
> これは、仕様でしょうか？
> 設定などで回避して文字化けなしで取り込む方法はありますでしょうか？
>   ※保護されていないPDFは、文字化けなく、取り込めます。
> 
> 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> 
> 
> ソース：
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> 補足：
> ・保護されたPDFは、手動でテキストのコピーができない。
> 
> 
> Tika初心者

I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com

I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
reading?
Moreover, I, without garbled in String, want to put converted to UTF-8.
I, I want you to tell me how the program.

I want to parse Then garbled in Tika.
Garbled.

-- 
question.answer...@gmail.com 



> Sorry, can't tell what the question is?
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 11:50 AM
> To: Allison, Timothy B. 
> Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Hi :)
> 
> I, in any way to, or should I use the following in the program of Tika?
> 私は、どのようにして、下記をTikaのプログラムで使えばいいですか？
> 
> -
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> -----
> 
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Ha, thank you for running google translate for me. :)
> > 
> > If the question is: "If I don't know the encoding before I send it to Tika, 
> > how does Tika determine the encoding?"
> > 
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > 
> > You can modify the service loading file to run the encoders in a different 
> > order or to specify your own encoding detector.
> > 
> > If the question is, "Why can't Tika get it right?"  Well, there are limits 
> > to statistical inference on only a few observations (small amount of 
> > bytes). :)
> > 
> > -Original Message-
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:06 AM
> > To: user@tika.apache.org
> > Cc: Allison, Timothy B. 
> > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Thank you for your answer.
> > 
> > I, character code of the file can not be determined EUC or Shift-JIS, 
> > UTF-8, etc. in advance.
> > I, or JAVA library, I want you to determine to Tika.
> > I want to know the determination method.
> > 
> > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > 私は、その判断方法を知りたい。
> > 
> > 
> > 技術初心者
> > 
> > 
> > 
> > > Again, relying on Google translate.
> > > 
> > > The problem with these files is that they don't self identify their 
> > > encoding via http metaheaders, and they contain very little content so 
> > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > > Chrome and Firefox all fail on these files, too.
> > > 
> > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > before the call to parse:
> > > 
> > > 
> > > Metadata metadata = new Metadata();
> > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >  
> > > , new ParseContext()); String plainText = handler.toString();
> > > 
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com 
> > > [mailto:question.answer...@gmail.com]
> > > Sent: Wednesday, September 14, 2016 7:37 AM
> > > To: user@tika.apache.org
> > > Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > > 
> > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> > >文字コードが変わったようで、文字化けしません。
> > > 
> > > ー
> > > こんにちは。
> > > 
> > > 困っております。
> > > 
> > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > > 
> > > 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> > > 
> > > ■読み込むと文字化けするhtmlを添付します。
> > >   ※EUCコードのファイルです。（秀丸エディタの判定では）
> > > 
> > > ソース：
> > > -
> > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > AutoDetectParser(); ContentHandler handler = new 
> > > BodyContentHandler(Integer.MAX_VALUE);
> > > Metadata metadata = new Metadata();
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >  
> > > , new ParseContext()); String plainText = handler.toString(); 
> > > System.out.println(plainText);
> > > -
> > > 
> > > 
> > > --
> > > 技術初心者
>

Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com



files  do result

PDF -->
HTML   -->   Tika do analysis.  -->  String (Java UTF-8)
TXT -->



-- 
question.answer...@gmail.com 


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -Original Message-
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. 
> > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか？
> > 
> > -
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > -
> > 
> > 
> > -- 
> > question.answer...@gmail.com 
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: user@tika.apache.org
> > > Cc: Allison, Timothy B. 
> > > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Thank you for your answer.
> > > 
> > > I, character code of the file can not be determined EUC or Shift-JIS, 
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > > 
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > > 
> > > 
> > > 技術初心者
> > > 
> > > 
> > > 
> > > > Again, relying on Google translate.
> > > > 
> > > > The problem with these files is that they don't self identify their 
> > > > encoding via http metaheaders, and they contain very little content so 
> > > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  
> > > > IE, Chrome and Firefox all fail on these files, too.
> > > > 
> > > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > > before the call to parse:
> > > > 
> > > > 
> > > > Metadata metadata = new Metadata();
> > > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > > 
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString();
> > > > 
> > > > 
> > &

Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com


I am the image of what you want to do.
Do you I have to do is fix anywhere in the program?

files  do result

PDF -->
HTML   -->   Tika do analysis.  -->  String (Java UTF-8)
TXT -->
  ^^
  ^^
EUC   UTF-8
Shift-JIS
GB2312
   :


> > > > -
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > > AutoDetectParser(); ContentHandler handler = new 
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > > 
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString(); 
> > > > System.out.println(plainText);
> > > > -----


Japanease:
私はTikaで各種ファイルを取り込みたい。
しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。
私は、どうすればいいですか？



-- 
question.answer...@gmail.com 


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -Original Message-
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. 
> > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか？
> > 
> > -
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > -
> > 
> > 
> > -- 
> > question.answer...@gmail.com 
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: user@tika.apache.org
> > > Cc: Allison, Timothy B. 
> > > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Thank you for your answer.
> > > 
> > > I, character code of the file can not be determined EUC or Shift-JIS, 
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > > 
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > > 
> > > 
> > > 技術初心者
> > > 
> &g

I garbled characters when you import a Chinese PDF.

2016-09-15 Thread question.answer...@gmail.com

I garbled characters when you import a Chinese PDF.   (in EUC, Shift-JIS, )
I want to read in UTF-8.
Or should I what coding?


below, it's my program now.
-
File document = new File(strFile_fullpath);

ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();

parser.getPDFParserConfig().setSuppressDuplicateOverlappingText(true);
parser.getPDFParserConfig().setExtractAnnotationText(false);

parser.parse(new FileInputStream(document), handler, metadata, new 
ParseContext());

System.out.plintln(handler.toString());
-


-- 
Syoshin

[Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.COSDictionary"

2016-09-16 Thread question.answer...@gmail.com

An exception is raised in line:"parser.parse(new Fil ".

"Exception : org.apache.pdfbox.cos.COSArray cannot be cast to
org.apache.pdfbox.cos.COSDictionary"

Why exception occurs?
In other dozens of PDF, the exception does not occur.



below, my program.
-
try {
File document = new File("/usr/local/sample.pdf");

PDFParser parser = new PDFParser();
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext());

String plainText = handler.toString();
System.out.println(plainText);
}
catch (FileNotFoundException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
catch (IOException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
catch (SAXException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
catch (TikaException e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
catch (Exception e) {
e.printStackTrace();
throw new RuntimeException(e.getMessage());
}
-

-- 
syosinnsya

Re: [Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.COSDictionary"

2016-09-16 Thread question.answer...@gmail.com

Thank you for your answer :)

By the way do you I can if you wait an answer from you?
I do not know what should I where to questions about pdfbox.

-- 
syosinnsya



> Could be a bug in PDFBox. Might want to ask on the pdfbox users' list.
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Friday, September 16, 2016 7:30 AM
> To: user@tika.apache.org
> Subject: [Tika] I have a question. --> "Exception : 
> org.apache.pdfbox.cos.COSArray cannot be cast to 
> org.apache.pdfbox.cos.COSDictionary"
> 
> An exception is raised in line:"parser.parse(new Fil ".
> 
> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to 
> org.apache.pdfbox.cos.COSDictionary"
> 
> Why exception occurs?
> In other dozens of PDF, the exception does not occur.
> 
> 
> 
> below, my program.
> -
> try {
> File document = new File("/usr/local/sample.pdf");
> 
> PDFParser parser = new PDFParser();
> ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext());
> 
> String plainText = handler.toString();
> System.out.println(plainText);
> }
> catch (FileNotFoundException e) {
> e.printStackTrace();
> throw new RuntimeException(e.getMessage()); } catch (IOException e) {
> e.printStackTrace();
> throw new RuntimeException(e.getMessage()); } catch (SAXException e) {
> e.printStackTrace();
> throw new RuntimeException(e.getMessage()); } catch (TikaException e) {
> e.printStackTrace();
> throw new RuntimeException(e.getMessage()); } catch (Exception e) {
> e.printStackTrace();
> throw new RuntimeException(e.getMessage()); }
> -
> 
> --
> syosinnsya

Apache Tikaで、PDFの本文内の文字が連続する現象発生

Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Re: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

I garbled characters when you import a Chinese PDF.

[Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.COSDictionary"

Re: [Tika] I have a question. --> "Exception : org.apache.pdfbox.cos.COSArray cannot be cast to org.apache.pdfbox.cos.COSDictionary"

12 matches

Site Navigation

Mail list logo

Footer information