[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-23 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390623#comment-15390623
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 7/23/16 11:52 AM:
---

My preference is option 1).
In this case I can freely work on my library and then just announce you version 
number of the latest release. Because, although it works now fine but there are 
some optimization that I decide to do in future. For example, some 
considerations about UTF-16, SAX support for big HTML documents and some 
performance related issues are in my TODO list.

To get rid of downside of this option maybe there would be a solution. For 
example, instead of depending to icu4j in my pom, I can either …
- copy/paste the few needed classes of icu4j into my source code, or
- add {{tika-xxx}} dependency (current fixed version that contains icu4j 
classes) with {{provided}} scope into my pom. (I’m not sure if {{provided}} can 
resolve circular dependency or not)

Please let me know what you think? If this is a reasonable solution?

Out of curiosity, I didn't traced the entire code of Tika but it seems that you 
are currently use icu4j 
[somewhere|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/utils/CharsetUtils.java#L90]
 in your code as an optional dependency! Also I've seen a piece of code that 
was just like this one in Apache PDFBox in which icu4j was used for text 
normalization, before.

Sure, but note that the charset in HTTP header of a requested page was my 
criteria to get that page in a category. For validation check I've visually 
inspected some of these documents but didn't see any problem. You now that 
though the charsets in HTTP headers are not thoroughly foolproof but they are 
the only reasonable criteria that are available. If you add my html test set to 
your testing corpus, let me know the address of it, thanks.


was (Author: faghani):
My preference is option 1).
In this case I can freely work on my library and then just announce you version 
number of the latest release. Because, although it works now fine but there are 
some optimization that I decide to do in future. For example, some 
considerations about UTF-16, SAX support for big HTML documentss and some 
performance related issues are in my TODO list.

To get rid of downside of this option maybe there would be a solution. For 
example, instead of depending to icu4j in my pom, either I can …
- copy/paste the few needed classes of icu4j into my source code, or
- add {{tika-xxx}} dependency (current fixed version that contains icu4j 
classes) with {{provided}} scope into my pom. (I’m not sure if {{provided}} can 
resolve circular dependency or not)

Please let me know what you think? If this is a reasonable solution?

Out of curiosity, I didn't traced the entire code of Tika but it seems that you 
are currently use icu4j 
[somewhere|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/utils/CharsetUtils.java#L90]
 in your code as an optional dependency! Also I've seen a piece of code that 
was just like this one in Apache PDFBox in which icu4j were used for text 
normalization, before.

Sure, but note that the charset in HTTP header of a requested page was my 
criteria to get that page in a category. For validation check I've visually 
inspected some of these documents but didn't see any problem. You now that 
though the charsets in HTTP headers are not thoroughly foolproof but they are 
the only reasonable criteria that are available. If you add my html test set to 
your testing corpus, let me know the address of it, thanks.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389466#comment-15389466
 ] 

Tim Allison edited comment on TIKA-2038 at 7/29/16 7:06 PM:


This is great!  -I've been wanting to add stripping of html markup because I 
also found that that confuses icu4j.- [EDIT: this is wrong, ICU4J already tries 
to do this for content between <...>]

See a comparison on our regression corpus 
[here|http://162.242.228.174/encoding_detection/].  ICU4j generally does better 
than mozilla, but we were getting quite a few incorrect Big5 from ICU4j when 
mozilla had windows-1252/ISO-8859-1.

Our current algorithm is to run the following in order.  The first one with a 
non-null answer is the encoding we choose:
{noformat}
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
{noformat}

It looks like you maintain this order, check for charset metaheader first, then 
detect if necessary.

Out of curiosity, did you compare the results of your algorithm against the 
metaheader info?  Do you have an estimate of how often that info is wrong?



was (Author: talli...@mitre.org):
This is great!  I've been wanting to add stripping of html markup because I 
also found that that confuses icu4j.

See a comparison on our regression corpus 
[here|http://162.242.228.174/encoding_detection/].  ICU4j generally does better 
than mozilla, but we were getting quite a few incorrect Big5 from ICU4j when 
mozilla had windows-1252/ISO-8859-1.

Our current algorithm is to run the following in order.  The first one with a 
non-null answer is the encoding we choose:
{noformat}
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
{noformat}

It looks like you maintain this order, check for charset metaheader first, then 
detect if necessary.

Out of curiosity, did you compare the results of your algorithm against the 
metaheader info?  Do you have an estimate of how often that info is wrong?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399886#comment-15399886
 ] 

Tim Allison edited comment on TIKA-2038 at 7/29/16 7:26 PM:


I'm attaching the raw results from running Tika against the corpus available on 
IUST-HTMLCharDet's github 
[site|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/encoding-wise].


was (Author: talli...@mitre.org):
I'm attaching the raw results from running Tika against the corpus available on 
UST-HTMLCharDet's github 
[site|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/encoding-wise].

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399911#comment-15399911
 ] 

Tim Allison edited comment on TIKA-2038 at 7/29/16 7:48 PM:


||Subdirectory||Detected by Tika||Count||Percent||
|GBK|   GBK |323|   77.1%
|GBK|   GB2312| 77| |   
|GBK|   GB18030|13| |   
|GBK|   UTF-8|  3| |
|GBK|   windows-1252|   3| |
|Shift_JIS| Shift_JIS|  639|99.8%|
|Shift_JIS| windows-1252|   1| |
|UTF-8| UTF-8|  642|97.7%|
|UTF-8| ISO-8859-1| 11| |   
|UTF-8| windows-1252|   4| |
|Windows-1251|  windows-1251|   313|99.7%|
|Windows-1251|  UTF-8|  1| |
|Windows-1256|  windows-1256|   597|92.6%|
|Windows-1256|  windows-1252|   24  | |
|Windows-1256|  ISO-8859-1| 10  | |
|Windows-1256|  UTF-8|  7   | |
|Windows-1256|  x-MacCyrillic|  5| |
|Windows-1256|  IBM866| 1   | |
|Windows-1256|  ISO-8859-5| 1| |



was (Author: talli...@mitre.org):
||Subdirectory||Detected by Tika||Count||Percent||
|GBK|   GBK |323|   77.1%
|GBK|   GB2312| 77| |   
|GBK|   GB18030|13| |   
|GBK|   UTF-8|  3| |
|GBK|   windows-1252|   3| |
|Shift_JIS| Shift_JIS|  639|99.8%|
|Shift_JIS| windows-1252|   1| |
|UTF-8| UTF-8|  642|97.7%|
|UTF-8| ISO-8859-1| 11| |   
|UTF-8| windows-1252|   4| |
|Windows-1251|  windows-1251|   313|99.7%|
|Windows-1251|  UTF-8|  1| |


> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-31 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401595#comment-15401595
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 6:33 AM:
-

I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* by removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L52]
* by directly calling the 2’nd and 3’rd steps of your algorithm. (not reliable, 
because there may be some intermediate processes)
* or simply by depending to Tika source code and commenting some codes in it!


was (Author: faghani):
I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* with removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDe

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-07-31 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401595#comment-15401595
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 6:42 AM:
-

I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* by removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L52]
* by directly calling the 2’nd and 3’rd steps of your algorithm. (not reliable, 
because there may be some intermediate processes)
* or simply by depending to Tika source code and commenting some code in it!


was (Author: faghani):
I got astonished by these results at first look! Because they are far better 
than what I’ve seen before, I mean when I tested Tika. Then I remembered that 
almost all of the test files in my corpus have charset information in their 
Meta tags… and according to the order of your algorithm, as you've stated in 
the first comment in this issue, it [looks for a charset in Meta 
tags|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L103]
 before anything else. Although, this is the same thing that [is done in my 
algorithm|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/HTMLCharsetDetector.java#L49]
 but it is optional in my case and for evaluations in my paper (for both 
encoding-wise and language-wise) I called {{detect(byte[] rawHtmlByteSequence, 
boolean... lookInMeta)}} method with {{false}} value for {{lookInMeta}} 
argument, because …

1) it seems that there is no charset information available (neither in HTTP 
header in crawl time nor in Meta tags in offline mode) for almost half of the 
all html documents, see *#primitive URLs* and *#sites with valid charset in 
HTTP header* in [Language-Wise 
Evaluation|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation]
 table, and …
2) as you know, for the other half that the charset information is available,  
there is no 100% guarantee that these information are valid.

So, to have a fair evaluation/comparison, the potential charsets in Meta tags 
should not be involved in detection process. Hence, for computing the accuracy 
of Tika-EncodingDetector the first step of your algorithm should be ignored. It 
can be done either …
* by removing ... (for each document in corpus)
** the value of {{content}} attribute that contains {{charset=xyz}} of a meta 
tag, see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L50],
 and
** the value of {{charset}} attribute of a {{meta}} tag (html5), see 
[this|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetec

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401715#comment-15401715
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 8/1/16 8:45 AM:
-

OK, so to give more details about my library to this community and also in 
response to your concerns I would to say:

1) You are right, my repo on github is fairly new (less than 1 year) but its 
algorithm is not new. I developed this library 4 years ago in order to be used 
in a large-scale project… and it works well from that time till now. It was 
under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in 
last week was just a tiny mistake that occurred during refactoring the code 
before the first release.

2) Since the accuracy was much more important than performance for us, I 
haven’t done a thorough performance test. Nevertheless, bellow I’ve provided 
the results of a small test that was done on my laptop (intel core i3, java 6, 
Xmx: default (don’t care)):
||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time 
(millisecond)||Average Time (millisecond)||
|UTF-8|657|32,216|49|26658|40|
|Windows-1251|314|30,941|99|4423|14|
|GBK|419|43,374|104|20317|48|
|Windows-1256|645|66,592|103|9451|14|
|Shift_JIS|640|25,973|41|7617|11|

Let’s have a little bit more precise look at these results. Due to the logic of 
my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet 
was used (no JSoup and no ICU4J). But as you can see from this table the 
required time is greater than the three other cases for which documents were 
parsed using JSoup and also the both JCharDet and ICU4J were involved in 
detection prrocess. It means that if the encoding of a page is UTF-8 the 
required time for a positive response from Mozilla JChardet is often greater 
than the required time for … 
* get a negative response from Mozilla JChardet +
* decode the input byte array using “ISO-8859-1”+
* parse that doc and creating DOM tree +
* extract text from DOM tree +
* encode the extracted text using “ISO-8859-1” +
* and detecting its encoding by using icu4j  
… when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the 
above table. 

Now let’s have a look at the [distribution of character encodings for websites| 
https://w3techs.com/technologies/history_overview/character_encoding]. Since 
~87% of all websites use UTF-8, if we use a statistical computation for 
computing the weighted average time for detecting encoding of a custom html 
document, I think we would get a similar estimate for both IUST-HTMLCharDet and 
Tika-EncodingDetector. Because, this estimate is strongly biased by Mozilla 
JCharDet and as we know this tool is used in the both algorithms in a similar 
way. Nevertheless, for performance optimizations I will do some tests for …
* using a Regex instead of navigating in DOM tree for seeking charsets in Meta 
tags
* stripping HTML Markups, Scripts and embedded CSSs directly instead of using a 
html parser

3) For computing the accuracy of Tika's legacy method I’ve provided a comment 
below your current evaluation results. As I’ve explained there, the results of 
your current evaluation couldn’t be compared with my evaluation.

bq. Perhaps we could add some code to do that?
Of course, but from experience when I use open sources in my projects, due to 
the versioning and updating considerations I don’t shift my code into them 
unless there wouldn’t be any other suitable solution/option. But pulling a part 
of their code into my projects is *another story*! :)


was (Author: faghani):
OK, so to give more details about my library to this community and also in 
response to your concerns I would to say:

1) You are right, my repo on github is fairly new (less than 1 year) but its 
algorithm is not new. I developed this library 4 years ago in order to be used 
in a large-scale project… and it works well from that time till now. It was 
under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in 
last week was just a tiny mistake that occurred during refactoring the code 
before the first release.

2) Since the accuracy was much more important than performance for us, I 
haven’t done a thorough performance test. Nevertheless, bellow I’ve provided 
the results of a small test that was done on my laptop (intel core i3, java 6, 
Xmx: default (don’t care)):
||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time 
(millisecond)||Average Time (millisecond)||
|UTF-8|657|32,216|49|26658|40|
|Windows-1251|314|30,941|99|4423|14|
|GBK|419|43,374|104|20317|48|
|Windows-1256|645|66,592|103|9451|14|
|Shift_JIS|640|25,973|41|7617|11|

Let’s have a little bit more precise look at these results. Due to the logic of 
my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet 
was used (no JSoup and no ICU4J). 

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401870#comment-15401870
 ] 

Tim Allison edited comment on TIKA-2038 at 8/1/16 11:17 AM:


bq. Then I remembered that almost all of the test files in my corpus have 
charset information in their Meta tags
To clarify, you're saying that almost all of the test files in the first corpus 
have charset information.  However, to confirm, in the second corpus (language 
dependent), that number drops to 50%, right?


was (Author: talli...@mitre.org):
>Then I remembered that almost all of the test files in my corpus have charset 
>information in their Meta tags
To clarify, in the first corpus.  However, to confirm, in the second corpus, 
that number drops to 50%, right?

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401985#comment-15401985
 ] 

Tim Allison edited comment on TIKA-2038 at 8/1/16 12:58 PM:


This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Many of these differences don't matter.  Of concern are those where 
WIndows-1251 and Windows-1256 are misidentified.  From a handful of tests, it 
looks like ICU4J gets the correct encoding for those two encodings when we 
remove the markup.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|UTF-8|windows-1252|1|




was (Author: talli...@mitre.org):
This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|HTMLDetector|UniversalEncodingDetector|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|HTMLDetector|Icu4jEncodingDetector|1|
|UTF-8|windows-1252|1|



> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Sinc

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401985#comment-15401985
 ] 

Tim Allison edited comment on TIKA-2038 at 8/1/16 8:51 PM:
---

This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Many of these differences don't matter.  Of concern are those where 
WIndows-1251 and Windows-1256 are misidentified.  From a handful of tests, it 
looks like ICU4J gets the correct encoding for those two encodings when we 
remove the markup in the  and  elements.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|UTF-8|windows-1252|1|




was (Author: talli...@mitre.org):
This includes the encodings as detected by: 1) Tika default, 2) HTML alone, 3) 
UniversalCharDet alone, 4) ICU4J alone

There are only 77 files for which the HTML detector is not able to extract an 
encoding in this set.  If we make the assumption that the html meta-header is 
most often correct, and use that as "ground truth" (with caveats!), we see the 
following when comparing to the other two detectors.

Many of these differences don't matter.  Of concern are those where 
WIndows-1251 and Windows-1256 are misidentified.  From a handful of tests, it 
looks like ICU4J gets the correct encoding for those two encodings when we 
remove the markup.

Comparisons of UniversalCharDet to the HTMLDetector:
||HTMLDetector||UniversalEncodingDetector||Count||
|UTF-8|windows-1252|437|
|windows-1256|windows-1252|340|
|GBK|GB18030|320|
|windows-1256|x-MacCyrillic|159|
|GB2312|GB18030|77|
|windows-1256|NULL|34|
|windows-1256|ISO-8859-1|22|
|windows-1256|ISO-8859-5|17|
|windows-1256|KOI8-R|16|
|UTF-8|ISO-8859-1|16|
|Shift_JIS|NULL|5|
|windows-1252|x-MacCyrillic|5|
|windows-1251|x-MacCyrillic|4|
|GBK|windows-1252|3|
|windows-1256|windows-1255|3|
|Shift_JIS|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|UTF-8|2|
|ISO-8859-1|x-MacCyrillic|1|
|UTF-8|windows-1251|1|
|windows-1256|IBM866|1|
|ISO-8859-1|windows-1252|1|
|windows-1256|ISO-8859-7|1|
|windows-1256|ISO-8859-8|1|

Comparisons of ICU4J to the HTMLDetector:
||HTMLDetector||ICU4J||Count||
|UTF-8|ISO-8859-1|465|
|windows-1256|ISO-8859-1|397|
|GBK|GB18030|314|
|windows-1251|ISO-8859-1|232|
|GB2312|GB18030|77|
|windows-1256|windows-1252|10|
|windows-1252|ISO-8859-1|7|
|GBK|ISO-8859-1|7|
|windows-1251|windows-1252|3|
|ISO-8859-1|windows-1252|2|
|UTF-8|GB18030|2|
|windows-1256|ISO-8859-2|2|
|windows-1256|UTF-16LE|1|
|ISO-8859-1|windows-1256|1|
|windows-1256|Big5|1|
|windows-1256|ISO-8859-9|1|
|GBK|windows-1252|1|
|GBK|EUC-KR|1|
|UTF-8|windows-1252|1|



> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documen

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406408#comment-15406408
 ] 

Tim Allison edited comment on TIKA-2038 at 8/3/16 6:51 PM:
---

I wrote a markup stripper that ignores content in tags, comments, 

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-04 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407514#comment-15407514
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 8/4/16 10:13 AM:
--

As I've said above URLs are available in the 
[./test-data/language-wise/|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 relative path of my repo. Note that you should use the last 8 files, not 
directories. Results of my evaluation are available within 
[results|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise/results]
 sub-directory.


was (Author: faghani):
No. Maybe you’ve got the answer of this question by reading my recent comment, 
anyways for more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a 
simple multi-threaded crawler to collect a fairly small one. I’ve used charset 
information that are available for almost half of the html pages in the HTTP 
header as validity measure. In fact the crawled pages that had charset 
information in their HTTP header were categorized in *corpus* directory by this 
information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the 
all requested pages by my crawler), the other half were just simply ignored. 
Since, almost all html pages that HTTP servers provide clients with the 
information about their charset also have charset information in their Meta 
tags, almost all docs in the first corpus have this information, though these 
two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a 
collection of 148,297 URLs extracted from Alexa top 1 million sites by using 
[Top Level Domain 
(TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains] names 
as the criteria for 8 languages. These URLs are available 
[here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 (last 8 files, not directories). Again in this evaluation we used charset 
information in HTTP header as the validity measure/ground truth and since this 
information was available only for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater 
than 85,292 but for the sake of various networking problems some of them were 
failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore 
after the test and I think their estimated aggregate size was at least ~1.7 GIG 
(85,292 * 20 KB = 1,706 GB).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-04 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407523#comment-15407523
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 8/4/16 10:21 AM:
--

Maybe you’ve got the answer of this question by reading my recent comment, 
anyways for more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a 
simple multi-threaded crawler to collect a fairly small one. I’ve used charset 
information that are available for almost half of the html pages in the HTTP 
header as validity measure. In fact the crawled pages that had charset 
information in their HTTP header were categorized in *corpus* directory by this 
information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the 
all requested pages by my crawler), the other half were just simply ignored. 
Since, almost all html pages that HTTP servers provide clients with the 
information about their charset also have charset information in their Meta 
tags, almost all docs in the first corpus have this information, though these 
two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a 
collection of 148,297 URLs extracted from Alexa top 1 million sites by using 
[Top Level Domain 
(TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains] names 
as the criteria for 8 languages. These URLs are available 
[here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 (last 8 files, not directories). Again in this evaluation we used charset 
information in HTTP header as the validity measure/ground truth and since this 
information was available only for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater 
than 85,292 but for the sake of various networking problems some of them were 
failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore 
after the test and I think their estimated aggregate size was at least ~1.7 GIG 
(85,292 * 20 KB = 1,706 GB).


was (Author: faghani):
No. Maybe you’ve got the answer of this question by reading my recent comment, 
anyways for more clarifications…

*First corpus:* Since there was not any benchmark in this context, I’ve wrote a 
simple multi-threaded crawler to collect a fairly small one. I’ve used charset 
information that are available for almost half of the html pages in the HTTP 
header as validity measure. In fact the crawled pages that had charset 
information in their HTTP header were categorized in *corpus* directory by this 
information as subdirectory, e.g. GBK, Windows-1251, etc. (almost half of the 
all requested pages by my crawler), the other half were just simply ignored. 
Since, almost all html pages that HTTP servers provide clients with the 
information about their charset also have charset information in their Meta 
tags, almost all docs in the first corpus have this information, though these 
two information are not necessarily the same!

*Second corpus:* There is no second corpus as you think. That is just a 
collection of 148,297 URLs extracted from Alexa top 1 million sites by using 
[Top Level Domain 
(TLD)|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains] names 
as the criteria for 8 languages. These URLs are available 
[here|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/test-data/language-wise]
 (last 8 files, not directories). Again in this evaluation we used charset 
information in HTTP header as the validity measure/ground truth and since this 
information was available only for 85,292 URLs, the rest were ignored.
Some points…
* The actual URLs count that had charset information in HTTP header was greater 
than 85,292 but for the sake of various networking problems some of them were 
failed in fetching
* We didn’t persist these 85,292 pages, because we didn’t need to them anymore 
after the test and I think their estimated aggregate size was at least ~1.7 GIG 
(85,292 * 20 KB = 1,706 GB).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HT

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418855#comment-15418855
 ] 

Tim Allison edited comment on TIKA-2038 at 8/12/16 1:51 PM:


bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample 
urls from Common Crawl based on country codes in the urls.  I can take care of 
this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.



was (Author: talli...@mitre.org):
bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to do 
sample urls from Common Crawl based on country codes in the urls.  I can take 
care of this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.


> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2016-08-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418855#comment-15418855
 ] 

Tim Allison edited comment on TIKA-2038 at 8/12/16 6:40 PM:


bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample 
urls from Common Crawl based on country codes in the urls.  I can take care of 
this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today...if I have time.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.

You can see an example of the output of the comparison code 
[here|https://github.com/tballison/share/blob/master/poi_comparisons/reports_poi_3_15-beta3_reports.zip?raw=true].


was (Author: talli...@mitre.org):
bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample 
urls from Common Crawl based on country codes in the urls.  I can take care of 
this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.


> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2017-02-08 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857043#comment-15857043
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 2/8/17 9:11 AM:
-

bq. I recognize that the mime types returned by the server are not necessarily 
correct, but this data might be useful.
Years ago when I was a novice java developer I engaged with mime types for a 
while and I know they are unreliable. Hence, I’m very concerned to use them for 
separating html documents. In this regard I suggest “an arrow with two targets” 
(a Persian proverb) solution! Since it seems that in this test the potential 
charset in meta headers is the only available thing that we can use as “ground 
truth”, if we use the 
[HtmlEncodingDetector|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java]
 class of Tika (with [META_TAG_BUFFER_SIZE 
|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L42]
 field that is set to Integer.MAX_VALUE), in addition to extract potential 
charsets from meta headers, it implicitly will act as a html filter.

I think we must also throw away documents with multiple charsets in meta 
headers (see TIKA-2050). This way we can also get rid of rss/feed documents 
that their mime types were set to html (we had some troubles with these 
documents in our project years ago). 

bq. If the goal is to get ~30k per tld, let's sample to obtain 50k on the 
theory that there are duplicates and other reasons for failure.
I think it would be better to use the idea in [this post| 
https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15422448&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15422448]
 for sampling. I will try to describe the idea in details in the next few days.

bq. Any other tlds or mime defs we should add?
I suggest to add *.mx* (Mexico), *.co* (Colombia), *.ar* (Argentina) in 
addition to *.es* for Spanish (the 2nd ranked language by [native speakers| 
https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers]). 
There is also no tld in your list for Portuguese, so I suggest to add *.br* 
(Brazil) and *.pt* (Portugal). *.id* (Indonesia), *.my* (Malaysia), *.nl* 
(Netherlands), … are some other important tlds.


was (Author: faghani):
bq. I recognize that the mime types returned by the server are not necessarily 
correct, but this data might be useful.
Five years ago when I was a novice java developer I engaged with mime types for 
a while and I know they are unreliable. Hence, I’m very concerned to use them 
for separating html documents. In this regard I suggest “an arrow with two 
targets” (a Persian proverb)! It seems that in this test the potential charset 
in meta headers is the only available thing that we can use as “ground truth”. 
So, if we use the 
[HtmlEncodingDetector|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java]
 class of Tika (with [META_TAG_BUFFER_SIZE 
|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L42]
 field that is set to Integer.MAX_VALUE), in addition to extract potential 
charsets from meta headers, it implicitly will act as a html filter.

I think we must throw away documents with multiple charsets in meta headers 
(see TIKA-2050). This way we can also get rid from rss/feed documents that 
their mime type is set to html (we had some trouble with these documents in our 
project years ago). 

bq. If the goal is to get ~30k per tld, let's sample to obtain 50k on the 
theory that there are duplicates and other reasons for failure.
I think it would be better to use the idea in [this post| 
https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15422448&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15422448]
 for sampling. I will try to describe the idea in details in the next few days.

bq. Any other tlds or mime defs we should add?
I suggest to add *.mx* (Mexico), *.co* (Colombia), *.ar* (Argentina) in 
addition to *.es* for Spanish (the 2nd ranked language by [native speakers| 
https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers]). 
There is also no tld in your list for Portuguese, so I suggest to add *.br* 
(Brazil) and *.pt* (Portugal). *.id* (Indonesia), *.my* (Malaysia), *.nl* 
(Netherlands), … are some other important tlds.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Iss

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2017-02-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857926#comment-15857926
 ] 

Tim Allison edited comment on TIKA-2038 at 2/8/17 12:29 PM:


bq. Since it seems that in this test the potential charset in meta headers is 
the only available thing that we can use as “ground truth”, if we use the 
HtmlEncodingDetector class of Tika (with META_TAG_BUFFER_SIZE field that is set 
to Integer.MAX_VALUE), in addition to extract potential charsets from meta 
headers, it implicitly will act as a html filter.

In the above sql/proposal, the mime is what was returned in the actual http 
headers, as recorded by CommonCrawl.  They are still somewhat noisy.  Let's put 
off talk about metaheaders and evaluation until we gather the data.

In the attached, I applied a "dominant" language code to each country.  For 
countries with multiple "dominant" languages, I used the country code ("in" -> 
"in").  This is a very rough attempt to get decent coverage of languages.  I 
then calculate how many pages from each country we'd want to collect to get 
roughly 50k per language.

I added your the codes you added above and a few others.  How does this look?



was (Author: talli...@mitre.org):
bq. Since it seems that in this test the potential charset in meta headers is 
the only available thing that we can use as “ground truth”, if we use the 
HtmlEncodingDetector class of Tika (with META_TAG_BUFFER_SIZE field that is set 
to Integer.MAX_VALUE), in addition to extract potential charsets from meta 
headers, it implicitly will act as a html filter.

In the above sql/proposal, the mime is what was returned in the actual http 
headers, as recorded by CommonCrawl.  They are still somewhat noisy.  Let's put 
off talk about metaheaders and evaluation until we gather the data.

In the attached, I applied a "dominant" language code to each country.  For 
countries with multiple "dominant" languages, I used the country code ("in" -> 
"in").  This is a very rough attempt to get decent coverage of languages.  I 
then calculate how many pages from each country we'd want to collect to get 
roughly 50k per language.

I added your country codes and a few others.  How does this look?


> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2017-02-10 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860415#comment-15860415
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 2/10/17 9:15 AM:
--

Attached, the H column is a naive implementation of the idea I’ve proposed 
before. _Starvation_ and _Malnutrition_ are quite obvious for some tlds in this 
column but altogether it properly reflects distribution of the html documents 
for selected tlds in Common Crawl. 


Although it’s possible to relieve the problems of this sampling algorithm but I 
think that isn’t so important, because as I’ve seen in my evaluations, the 
accuracy of each detector algorithm was converged to a specific number after 
processing just a few portion of each tld. So, I think selecting either method 
(mine or yours) for sampling won't have a meaningful effect on the results, 
however will a bit affect on the weighted aggregated results (see the + and * 
group bars in the coarse-grained result diagram of the lang-wise-eval attached 
files).

bq. Let's put off talk about metaheaders and evaluation until we gather the 
data.

Ok.

bq. I added your the codes you added above and a few others. How does this look?

Looks fine to me.


was (Author: faghani):
Attached, the H column is a naive implementation of the idea I’ve proposed 
before. _Starvation_ and _Malnutrition_ are quite obvious for some tlds in this 
column but altogether that properly reflects distribution of the selected tlds 
in Common Crawl. 


Although it’s possible to relieve the problems of this sampling but I think 
that isn’t so important, because as I’ve seen in my evaluations, after just a 
few percent of each tld got processed the accuracy of the all detector 
algorithms got converged. So, I think selecting either method (mine or yours) 
for sampling won't have a meaningful effect on the results, however will a bit 
affect on the weighted aggregated results (see + and * group bars in the 
coarse-grained result of the lang-wise-eval attached files).

bq. Let's put off talk about metaheaders and evaluation until we gather the 
data.

Ok.

bq. I added your the codes you added above and a few others. How does this look?

Looks fine to me, at least at this stage.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, 
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2017-03-04 Thread Shabanali Faghani (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM:
-

Perfect reply, [~talli...@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in  headers if they also 
include "charset". … I included the output of the stripped HTMLMeta detector as 
a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be 
_InputStream_. This is required if we decided to be too conservative about OOM 
error or avoiding from resource wasting for big html files. I know writing a 
perfect _html stream stripper_ with the minimal faults 
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup 
should be able to to do so but there are two problems including _chicken and 
egg_ and _performance_. The former problem can be solved by _ISO-8859-1 
encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| 
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list 
that if they ever have done a thing like this or could they help us. We may 
also suggest/introduce IUST (the standalone version) to them. IIRC, in Jsoup 
1.6.1-3 (and most likely now) the charset of a page was supposed/considered as 
UTF-8 if the http header didn’t contain any charset or the charset was not 
specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ 
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to 
do so. Nevertheless, I think we can ignore this for the first version because I 
think that haven’t a meaningful effect on the algorithm. In fact I think 
calling the detection methods of JCharDet and ICU4j with InputStream input will 
a bit increase the efficiency in charge of a bit decrease in the accuracy.
 
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure 
which version I should use. The one on github or the proposed modification 
above or both? Let me know which code you'd like me to run.
 
The _modified IUST_ isn’t yet complete. To complete it we must prepare a 
thorough list of languages for which the stripping shouldn’t be done.  These 
languages/tlds are determined by comparing the results of the IUST with and 
without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit 
stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_ 
(IUST without stripping) from the [lang-wise-eval source code| 
https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
 The accuracy of _modified IUST_ (the pseudo code above) can be computed 
algorithmically by selecting the best from the two for each language/tld .
 
bq. I want to focus on accuracy first. We still have to settle on an eval 
method. But, yes, I do want to look at this. (/)


was (Author: faghani):
Perfect reply, [~talli...@mitre.org]. Thank you!
 
bq. The current version of the stripper leaves in  headers if they also 
include "charset". … I included the output of the stripped HTMLMeta detector as 
a sanity check … (/)
 
bq. I figure that we'll be modifying the stripper …
 
We might need the stripper works like a SAX parser, i.e the input should be 
_InputStream_. This is required if we decided to be too conservative about OOM 
error or avoiding from resource wasting for big html files. I know writing a 
perfect _html stream stripper_ with the minimal faults 
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup 
should be able to to do so but there are two problems including _chicken and 
egg_ and _performance_. The former problem can be solved by _ISO-8859-1 
encoding-decoding_ trick but there is no solution for the latter.

For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley| 
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list 
that if they ever have done a thing like this or could they help us. We may 
also suggest/introduce IUST (the standalone version) to them. This is quite 
like a gif entitled “_Adding a citation to a paper possibly written by the 
reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html], 
mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of 
a page was supposed/considered as UTF-8 if the http header didn’t contain any 
charset or the charset was not specified in input.
 
bq. … and possibly IUST.
 
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_ 
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to 
do so. Oh, still a lot of works to do … :( N

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-19 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692323#comment-16692323
 ] 

Hans Brende edited comment on TIKA-2038 at 11/19/18 10:28 PM:
--

[~faghani] 
[~talli...@apache.org] 

This issue inspired me to look into how jchardet implemented UTF-8 detection, 
since that library appears to be the key to much greater accuracy. Turns out, 
it's rather simple: it uses a UTF-8 state machine that goes into an error state 
if any invalid UTF-8 byte sequence is detected, and if not, keeps UTF-8 at 
index 0 of "probable charsets". Unfortunately, I did see that jchardet v. 1.1 
has two bugs: (1) legal code points in the Supplementary Multilingual Plane are 
counted as errors, and (2) illegal code points past 0x10 are counted as 
legal.

To fix these two bugs and narrow the scope of what is needed from jchardet to 
solely UTF-8 detection, I ended up implementing an improved UTF-8 state machine 
which you might find useful here: https://github.com/HansBrende/f8. I also made 
it available on maven at: org.rypt:f8:1.0.

Peering into the source code of the IUST project, I see that the following 
lines:

{code:java}
charset = HTMLCharsetDetector.mozillaJCharDet(rawHtmlByteSequence);
if (charset.equalsIgnoreCase("UTF-8")) {
return Charsets.normalize(charset);
}

private static String mozillaJCharDet(byte[] bytes) {
nsDetector det = new nsDetector(nsDetector.ALL);
det.DoIt(bytes, bytes.length, false);
det.DataEnd();
return det.getProbableCharsets()[0];
}
{code}

could be replaced with:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.countInvalid() == 0) {
return "UTF-8";
}
{code}

without loss of accuracy (and in fact, with greater accuracy, due to the 2 
bugfixes).

Futhermore, by taking a hint from ICU4j (which counts an InputStream as valid 
UTF-8 as long as the number of valid UTF-8 multi-byte sequences is at least an 
*order of magnitude* greater than the number of invalid UTF-8 sequences to 
allow for possibly corrupted UTF-8 data), this method could be further improved 
by doing:

{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.looksLikeUtf8()) { //implemented as: countValid() > 
countInvalidIgnoringTruncation() * 10
return "UTF-8";
}
{code}

Please let me know your thoughts!




was (Author: hansbrende):
[~faghani] 
[~talli...@apache.org] 

This issue inspired me to look into how jchardet implemented UTF-8 detection, 
since that library appears to be the key to much greater accuracy. Turns out, 
it's rather simple: it uses a UTF-8 state machine that goes into an error state 
if any invalid UTF-8 byte sequence is detected, and if not, keeps UTF-8 at 
index 0 of "probable charsets". Unfortunately, I did see that jchardet v. 1.1 
has two bugs: (1) legal code points in the Supplementary Multilingual Plane are 
counted as errors, and (2) illegal code points past 0x10 are counted as 
legal.

To fix these two bugs and narrow the scope of what is needed from jchardet to 
solely UTF-8 detection, I ended up implementing an improved UTF-8 state machine 
which you might find useful here: https://github.com/HansBrende/f8. I also made 
it available on maven at: org.rypt:f8:1.0.

Peering into the source code of the IUST project, I see that the following 
lines:

{code:java}
charset = HTMLCharsetDetector.mozillaJCharDet(rawHtmlByteSequence);
if (charset.equalsIgnoreCase("UTF-8")) {
return Charsets.normalize(charset);
}

private static String mozillaJCharDet(byte[] bytes) {
nsDetector det = new nsDetector(nsDetector.ALL);
det.DoIt(bytes, bytes.length, false);
det.DataEnd();
return det.getProbableCharsets()[0];
}
{code}

could be replaced with:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.countInvalid() == 0) {
return "UTF-8";
}
{code}

without loss of accuracy (and in fact, with greater accuracy, due to the 2 
bugfixes).

Futhermore, by taking a hint from ICU4j (which counts an InputStream as valid 
UTF-8 as long as the number of valid UTF-8 multi-byte sequences is at least an 
*order of magnitude* greater than the number of invalid UTF-8 sequences to 
allow for possibly corrupted UTF-8 data), this method could be further improved 
by doing:

{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.looksLikeUtf8()) {
return "UTF-8";
}
{code}

Please let me know your thoughts!



> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: htt

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-21 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694940#comment-16694940
 ] 

Hans Brende edited comment on TIKA-2038 at 11/21/18 5:03 PM:
-

Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal, and minus the two bugs I mentioned). This is 
definitely the most performant option, although you'd lack more detailed text 
statistics about the number of valid/invalid/ascii sequences.


was (Author: hansbrende):
Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal). This is definitely the most performant option, 
although you'd lack more detailed text statistics about the number of 
valid/invalid/ascii sequences.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-21 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694940#comment-16694940
 ] 

Hans Brende edited comment on TIKA-2038 at 11/21/18 5:05 PM:
-

Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal, and minus the two bugs I mentioned. Also, if a byte 
sequence is truncated in the middle of a valid byte sequence, it will count 
that as invalid too--whereas jchardet would count that as valid). This is 
definitely the most performant option, although you'd lack more detailed text 
statistics about the number of valid/invalid/ascii sequences.


was (Author: hansbrende):
Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal, and minus the two bugs I mentioned). This is 
definitely the most performant option, although you'd lack more detailed text 
statistics about the number of valid/invalid/ascii sequences.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-25 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698305#comment-16698305
 ] 

Hans Brende edited comment on TIKA-2038 at 11/26/18 4:54 AM:
-

[~faghani] Thanks for the response! If my understanding of the jchardet & IUST 
source code is correct, splitting off the UTF-8 detector should be possible, 
because the method
{code:java}
getProbableCharsets(){code}
does not return the charsets in the order of "best match first" (as icu4j 
does), but rather, in the order of "first tested first" (and UTF-8 is *always* 
at index 0 in this ordering if it was not detected to be invalid).


was (Author: hansbrende):
[~faghani] Thanks for the response! If my understanding of the jchardet & IUST 
source code is correct, splitting off the UTF-8 detector should be possible, 
because the method {code:java}getProbableCharsets(){code} does not return the 
charsets in the order of "best match first" (as Tika does), but rather, in the 
order of "first tested first" (and UTF-8 is *always* at index 0 in this 
ordering if it was not detected to be invalid).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-25 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698513#comment-16698513
 ] 

Hans Brende edited comment on TIKA-2038 at 11/26/18 7:10 AM:
-

Here's a more rigorous demonstration of my claim (by counterexample): Supposing 
jchardet ordered {{getProbableCharsets()}} by "best match first", then we would 
expect the first element of {{getProbableCharsets()}} to always match the 
charset reported to the {{nsICharsetDetectionObserver}}. However, that is not 
the case, as evident by the following unit test:

{code:java}
@Test
public void checkReportedMatchesFirstProbable() {
final byte[] testBytes = {
0x40, 0x32, 0x2A, 0x3E, 0x13, 0x2D, 0x61, 0x35,
0x72, 0x12, 0x1C, 0x1A, 0x2B, 0x0B, 0x6A, 0x08,
0x55, 0x7C, 0x1F, 0x6E, 0x56, 0x7D, 0x7E, 0x7B,
0x05, 0x32, 0x7E, 0x7D, 0x73
};
ArrayList reportedCharsets = new ArrayList<>();
nsICharsetDetectionObserver observer = reportedCharsets::add;

nsDetector det = new nsDetector(nsDetector.ALL);
det.Init(observer);
det.DoIt(testBytes, testBytes.length, false);
det.DataEnd();
org.junit.Assert.assertEquals(1, reportedCharsets.size());
org.junit.Assert.assertEquals(reportedCharsets.get(0), 
det.getProbableCharsets()[0]);
}
{code}

Results in a FAILED test:
{noformat}
org.junit.ComparisonFailure: 
Expected :HZ-GB-2312
Actual   :UTF-8

Process finished with exit code 255
{noformat}


was (Author: hansbrende):
Here's a more rigorous demonstration of my claim (by counterexample): Supposing 
jchardet ordered {{getProbableCharsets()}} by "best match first", then we would 
expect the first element of {{getProbableCharsets()}} to always match the 
charset reported to the {{nsICharsetDetectionObserver}}. However, that is not 
the case, as evident by the following unit test:

{code:java}
@Test
public void checkReportedMatchesFirstProbable() {
final byte[] testBytes = {
0x40, 0x32, 0x2A, 0x3E, 0x13, 0x2D, 0x61, 0x35,
0x72, 0x12, 0x1C, 0x1A, 0x2B, 0x0B, 0x6A, 0x08,
0x55, 0x7C, 0x1F, 0x6E, 0x56, 0x7D, 0x7E, 0x7B,
0x05, 0x32, 0x7E, 0x7D, 0x73
};
ArrayList reportedCharsets = new ArrayList<>();
nsICharsetDetectionObserver observer = reportedCharsets::add;

nsDetector det = new nsDetector(nsDetector.ALL);
det.Init(observer);
det.DoIt(testBytes, testBytes.length, false);
det.DataEnd();

org.junit.Assert.assertEquals(reportedCharsets.get(0), 
det.getProbableCharsets()[0]);
}
{code}

Results in a FAILED test:
{noformat}
org.junit.ComparisonFailure: 
Expected :HZ-GB-2312
Actual   :UTF-8

Process finished with exit code 255
{noformat}

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-11-26 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698539#comment-16698539
 ] 

Hans Brende edited comment on TIKA-2038 at 11/26/18 2:38 PM:
-

The success of this IUST implementation (even if based on "mis"-using jchardet) 
makes perfect sense to me though: UTF-8 currently makes up over 92% of the web. 
Therefore, UTF-8 *should* be biased towards false positives (or rather, a lack 
of false negatives, even at the expense of false positives), as that will 
result in an average increase in accuracy, whereas, if anything other than 
UTF-8 is biased towards false positives, that is practically guaranteed to 
decrease overall accuracy.

IMHO, the absence of indicators that data is not UTF-8 encoded should be 
sufficient indication that it is. (And IUST's success seems to support this 
notion.) This probably wasn't the case even 5 years ago, but today, the only 
way to make encoding detectors more accurate is to first identify the best 
indicators that data is *not* UTF-8 encoded, and only *then* to fall back to 
other non-UTF-8 detection algorithms.


was (Author: hansbrende):
The success of this IUST implementation (even if based on "mis"-using jchardet) 
makes perfect sense to me though: UTF-8 currently makes up over 92% of the web. 
Therefore, UTF-8 *should* be biased towards false positives, as that will 
result in an average increase in accuracy, whereas, if anything other than 
UTF-8 is biased towards false positives, that is practically guaranteed to 
decrease overall accuracy.

IMHO, the absence of indicators that data is not UTF-8 encoded should be 
sufficient indication that it is. (And IUST's success seems to support this 
notion.) This probably wasn't the case even 5 years ago, but today, the only 
way to make encoding detectors more accurate is to first identify the best 
indicators that data is *not* UTF-8 encoded, and only *then* to fall back to 
other non-UTF-8 detection algorithms.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-12-02 Thread Shabanali Faghani (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698268#comment-16698268
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 12/2/18 10:32 PM:
---

[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on TIKA-2771 and also 
[CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and TIKA-2750 by 
[~talli...@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more 
efficient and standalone with no dependency, I did also a small try to separate 
jchardet's UTF-8 detector after my last comment here. If I remember correctly, 
it keeps a small list that is correlated to its detectors and at the end of 
detection process it scans this list to find the best match. So, I thought it's 
impossible to split its UTF-8 detector, because sometimes it might detect the 
charset of a page something other than UTF-8 due to a higher probability in 
precence of UTF-8 in the list. If this would be true, in absence of other 
detectors jchardet will detect these cases as UTF-8 and this means that its 
false-positive for UTF-8 will be increased (true-negative will be decreased), 
... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with 
jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my 
mobile phone!


was (Author: faghani):
[~HansBrende] thank you for your interest to IUST and for your great analysis.

With regard to your work here and on Tika-2771 and also 
[CommonCrawl3|https://wiki.apache.org/tika/CommonCrawl3] and Tika-2750 by 
[~talli...@apache.org], looks like it's the time to resume this thread.

The algorithm of jchardet is just like what you've described. To make IUST more 
efficient and standalone with no dependency, I did also a small try to separate 
jchardet's UTF-8 detector after my last comment here. If I remember correctly, 
it keeps a small list that is correlated to its detectors and at the end of 
detection process it scans this list to find the best match. So, I thought it's 
impossible to split its UTF-8 detector, because sometimes it might detect the 
charset of a page something other than UTF-8 due to a higher probability in 
precence of UTF-8 in the list. If this would be true, in absence of other 
detectors jchardet will detect these cases as UTF-8 and this means that its 
false-positive for UTF-8 will be increased (true-negative will be decreased), 
... don't know maybe dramatically!

I'll test the false-positive and true-negative of f8 and compare it with 
jchardet. Hope I've been wrong.

I'll take care of this next week ... now I'm on holiday and am typing with my 
mobile phone!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
>  Issue Type: Improvement
>  Components: core, detector
>Reporter: Shabanali Faghani
>Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

2018-12-02 Thread Shabanali Faghani (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706496#comment-16706496
 ] 

Shabanali Faghani edited comment on TIKA-2038 at 12/3/18 6:44 AM:
--

{quote}UTF-8 currently makes up over 92% of the web.
{quote}
You've well understanding of the context. I've noticed that you've pointed to 
it in TIKA-2771, too. Hence I decided to provide further information about it 
based on the _results_ table in _AGGREGATED-RESULTS.db_ in _fine-grained_ 
folder of [^lang-wise-eval_results.zip].
{code:sql}
SELECT
  language AS Language, count(1) AS Total_Docs,
  round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1), 
2) AS UTF_8,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / 
count(1), 2) AS ISO_8859_1,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / 
count(1), 2) AS Windows_1256,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / 
count(1), 2) AS Windows_1252,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / 
count(1), 2) AS ISO_8859_15,
  round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1), 
2) AS GB2312,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1), 
2) AS EUC_KR,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / 
count(1), 2) AS ISO_8859_9,
  round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2) 
AS GBK,
  round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / 
count(1), 2) AS GB18030,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1), 
2) AS EUC_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / 
count(1), 2) AS Shift_JIS,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / 
count(1), 2) AS ISO_2022_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / 
count(1), 2) AS US_ASCII,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / 
count(1), 2) AS ISO_8859_2,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / 
count(1), 2) AS Windows_1251,
  round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1), 
2) AS KOI8_R
FROM RESULTS
GROUP BY language
UNION ALL
SELECT
  'X-ALL', count(1),
  round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1), 
2) AS UTF_8,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / 
count(1), 2) AS ISO_8859_1,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / 
count(1), 2) AS Windows_1256,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / 
count(1), 2) AS Windows_1252,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / 
count(1), 2) AS ISO_8859_15,
  round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1), 
2) AS GB2312,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1), 
2) AS EUC_KR,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / 
count(1), 2) AS ISO_8859_9,
  round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2) 
AS GBK,
  round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / 
count(1), 2) AS GB18030,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1), 
2) AS EUC_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / 
count(1), 2) AS Shift_JIS,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / 
count(1), 2) AS ISO_2022_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / 
count(1), 2) AS US_ASCII,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / 
count(1), 2) AS ISO_8859_2,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / 
count(1), 2) AS Windows_1251,
  round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1), 
2) AS KOI8_R
FROM RESULTS;
 -- the row as column queries without pivot table tend to become verbose!
{code}
||Language||Total_Docs||UTF_8||ISO_8859_1||Windows_1256||Windows_1252||ISO_8859_15||GB2312||EUC_KR||ISO_8859_9||GBK||GB18030||EUC_JP||Shift_JIS||ISO_2022_JP||US_ASCII||ISO_8859_2||Windows_1251||KOI8_R||
||Arabic|1168|95.21|1.11|3|0.51|0.09|0.09|0|0|0|0|0|0|0|0|0|0|0|
||Chinese|3860|85.8|0.36|0|0|0|6.97|0|0|6.79|0.05|0.03|0|0|0|0|0|0|
||English|13582|96.43|3.38|0|0.09|0.07|0|0|0|0|0|0|0|0|0.01|0.01|0.01|0|
||French|8712|93.63|4.94|0|0.61|0.83|0|0|0|0|0|0|0|0|0|0|0|0|
||German|24884|91.24|7.1|0|0.23|1.39|0|0|0.01|0|0|0|0|0|0|0.02|0.01|0|
||Indian|6158|97.84|1.92|0|0|0.03|0|0.02|0.02|0|0|0|0.02|0|0|0|0.16|0|
||Italian|7371|94.44|4.95|0|0.34|0.26|0|0|0|0|0|0|0|0|0.01|0|0|0|
||Japanese|7736|89.88|0.28|0|0|0|0|0.03|0|0|0|3.94