Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Thanks
I have created an issue.

metadata.set(RESOURCE_NAME_KEY, filename) also did not work. For now I am 
telling the parser specifically it is plain text files. But it would be really 
nice to have this addressed because I would like to use the auto detect ability 
in my app.

regards




> On 14 Oct 2015, at 11:11, Nick Burch  wrote:
> 
> On Wed, 14 Oct 2015, Ziqi Zhang wrote:
>> As for bugzilla, I was unable to create a new bug, as it is saying “first 
>> you must pick a product…” and there is no tika in the list.
> 
> Sorry, wrong project - POI uses Bugzilla, Tika uses JIRA, I wasn't paying 
> enough attention!
> 
> The starting point for reporting the bug is:
>   https://issues.apache.org/jira/browse/TIKA
> 
> Nick



Re: AutoDetectParser bug?

2015-10-14 Thread Nick Burch

On Wed, 14 Oct 2015, Ziqi Zhang wrote:
As for bugzilla, I was unable to create a new bug, as it is saying 
“first you must pick a product…” and there is no tika in the list.


Sorry, wrong project - POI uses Bugzilla, Tika uses JIRA, I wasn't paying 
enough attention!


The starting point for reporting the bug is:
   https://issues.apache.org/jira/browse/TIKA

Nick

Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Many thanks

As for bugzilla, I was unable to create a new bug, as it is saying “first you 
must pick a product…” and there is no tika in the list.



> On 14 Oct 2015, at 10:40, Konstantin Gribov  wrote:
> 
> This is a result of false positive mime-type detection. In first case file 
> starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other 
> two files starts with P1 or P4 which are present in start of 
> image/x-portable-bitmap files.
> 
> You can either use text parser directrly or pass filename via metadata using 
> metadata.set(RESOURCE_NAME_KEY, filename).
> 
> ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang  <mailto:ziqi.zh...@sheffield.ac.uk>>:
> My apologies, here are the testing files attached.
> 
> 
> 
>> Begin forwarded message:
>> 
>> From: Ziqi Zhang > <mailto:ziqi.zh...@sheffield.ac.uk>>
>> Date: 14 October 2015 at 10:06:33 BST
>> To: user@tika.apache.org <mailto:user@tika.apache.org>
>> Subject: AutoDetectParser bug?
> 
>> 
>> Hi
>> 
>> There might be a bug with the AutoDetectParser, which fails to recognise 
>> some plain-text files as plain text.
>> 
>> In the attachment are three testing files, as you can see they are all plain 
>> text.
>> 
>> The following code is used for my testing:
>> 
>> 
>> AutoDetectParser parser = new AutoDetectParser();
>> for (File f : new 
>> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
>> InputStream in = new BufferedInputStream(new 
>> FileInputStream(f.toString()));
>> BodyContentHandler handler = new BodyContentHandler(-1);
>> Metadata metadata = new Metadata();
>> try {
>> 
>> parser.parse(in, handler, metadata);
>> String content = handler.toString();
>> System.out.println(metadata); //line A
>> }catch (Exception e){
>> e.printStackTrace();
>> }
>> }
>> 
>> for the three testing files, I would expect line A to print “plain text”, in 
>> fact, it is printing the following:
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> X-Parsed-By=org.apache.tika.parser.DefaultParser 
>> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
>> Content-Type=audio/mpeg 
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> 
>> And as a result, variable “content” is always empty.
>> 
>> Any suggestions on this please?
>> 
>> Thanks
>> 
> 
> -- 
> Best regards,
> Konstantin Gribov



Re: AutoDetectParser bug?

2015-10-14 Thread Konstantin Gribov
This is a result of false positive mime-type detection. In first case file
starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other
two files starts with P1 or P4 which are present in start of
image/x-portable-bitmap files.

You can either use text parser directrly or pass filename via metadata
using metadata.set(RESOURCE_NAME_KEY, filename).

ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang :

> My apologies, here are the testing files attached.
>
>
>
> Begin forwarded message:
>
> *From: *Ziqi Zhang 
> *Date: *14 October 2015 at 10:06:33 BST
> *To: *user@tika.apache.org
> *Subject: **AutoDetectParser bug?*
>
>
> Hi
>
> There might be a bug with the AutoDetectParser, which fails to recognise
> some plain-text files as plain text.
>
> In the attachment are three testing files, as you can see they are all
> plain text.
>
> The following code is used for my testing:
>
> 
>
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new 
> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
> InputStream in = new BufferedInputStream(new 
> FileInputStream(f.toString()));
> BodyContentHandler handler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> try {
>
> parser.parse(in, handler, metadata);
> String content = handler.toString();
> System.out.println(metadata); //line A
> }catch (Exception e){
> e.printStackTrace();
> }
> }
>
> 
>
> for the three testing files, I would expect line A to print “plain text”, in 
> fact, it is printing the following:
>
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap
>
> X-Parsed-By=org.apache.tika.parser.DefaultParser 
> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
> Content-Type=audio/mpeg
>
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap
>
>
> And as a result, variable “content” is always empty.
>
>
> Any suggestions on this please?
>
>
> Thanks
>
>
> --
Best regards,
Konstantin Gribov


Re: Fwd: AutoDetectParser bug?

2015-10-14 Thread Nick Burch

On Wed, 14 Oct 2015, Ziqi Zhang wrote:

My apologies, here are the testing files attached.


Any chance you could open a bug in bugzilla, and attach these files there?

At first glance, it looks like those files have some certain text patterns 
near the start which is causing them to be mis-detected as not-text. We 
might need to try altering those patterns a bit


Otherwise, if you know a file really is a text file, you could either call 
the text parser directly, or try setting the content type to strengthen 
the hint


Nick


Fwd: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
My apologies, here are the testing files attached.Begin forwarded message:From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>Date: 14 October 2015 at 10:06:33 BSTTo: user@tika.apache.orgSubject: AutoDetectParser bug?HiThere might be a bug with the AutoDetectParser, which fails to recognise some plain-text files as plain text.In the attachment are three testing files, as you can see they are all plain text.The following code is used for my testing:AutoDetectParser parser = new AutoDetectParser();for (File f : new File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));BodyContentHandler handler = new BodyContentHandler(-1);Metadata metadata = new Metadata();try {parser.parse(in, handler, metadata);String content = handler.toString();System.out.println(metadata); //line A}catch (Exception e){e.printStackTrace();}}for the three testing files, I would expect line A to print “plain text”, in fact, it is printing the following:X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 Content-Type=audio/mpeg X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap And as a result, variable “content” is always empty.Any suggestions on this please?ThanksID3 , C4 .5 and C5 .0 produce decision trees , RIPPER isa rule-based learner 
and the Naive Bayes algorithm computes conditional probabilities of the classes 
from the instances .
In all experiments the SVM_Light system outperformed other learning algorithms 
, which confirms Yang 's -LRB- Yang and Liu , 1999 -RRB- results for svms fed 
with Reuters data .
For LVQ the decrease may be due to the fact that no adaptations to results , 
allowing to measure the accuracy of the top five alternatives -LRB- Best5 -RRB- 
.
If a new category is introduced , the accuracy will slightly decline until 30 
documents are manually classified and the category is automatically included 
into a new classifier .
STP may use both general linguistic knowledge and linguistic algorithms or 
heuristics adapted to the application in order to extract information from 
texts that is relevant for classification .
Obviously , the change of topics can be accommodated by adding new categories 
and e-mails and producing a new classifier on the basis of old and new data .
These properties influenced the system architecture , which is presented in 
Section 3 . Various publicly available SML systems have been tested with 
different methods of STP-based preprocessing .
Call center agents judge the performance of ICC-MAIL most easily in terms of 
accuracy : In what percentage of cases does the classifier suggest the correct 
text block ?
A client\/server solution was built that allows the call center agents to 
connect as clients to the ICe-MAIL server , which implements the system 
described in Section 3 .
MorphAna : Morphological Analysis provided by sines yields the word stems of 
nouns , verbs and adjectives , as well as the full forms of unknown words .
Combined : In order to emphasize words found relevant by the STP heuristics 
without losing other information retrieved by MorphAna , the previous two 
techniques are combined .
If an e-mail contains several questions , the classification process can be 
repeated by marking each question and iteratively applying the process to the 
marked part .
the domain were made , such as adapting the number of codebook vectors , the 
initial learning parameters or the number of iterations during training -LRB- 
cf.
We noted that in six trials the accuracy could be improved in Combined compared 
to MorphAna , but in four trials , boosting led to deterioration .
This includes heuristics for the identification of multiple requests in a 
single e-mail that could be based on key words and key phrases as well as on 
the analysis of the document structure .
\* A reorganization of the existing three-level cate - null gory system into a 
semantically consistent tree structure would allow us to explore the 
nonterminal nodes of the tree for multi-layered SML .
The whole process brings about high costs in analyzing and modeling the 
application domain , especially if it is to take into account the problem of 
changing categories in the present application .
The implementation and usage of the system including the graphical user 
interface is presented in Section 5 . We conclude by giving an outlook to 
further expected improvements -LRB- Section 6 -RRB- .
In the categorization phase , the new document is preprocessed , and a result 
vector is built as described above and handed over to the categorizer -LRB- cf. 
Figure 1 -RRB- .
Negations were found to describe a state to be changed or to refer to missing 
objects , as in I can

AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Hi

There might be a bug with the AutoDetectParser, which fails to recognise some 
plain-text files as plain text.

In the attachment are three testing files, as you can see they are all plain 
text.

The following code is used for my testing:


AutoDetectParser parser = new AutoDetectParser();
for (File f : new 
File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
try {

parser.parse(in, handler, metadata);
String content = handler.toString();
System.out.println(metadata); //line A
}catch (Exception e){
e.printStackTrace();
}
}

for the three testing files, I would expect line A to print “plain text”, in 
fact, it is printing the following:
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 
X-Parsed-By=org.apache.tika.parser.DefaultParser 
X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
Content-Type=audio/mpeg 
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 

And as a result, variable “content” is always empty.

Any suggestions on this please?

Thanks