[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-26 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184464#comment-14184464
 ] 

Bin Hawking commented on TIKA-1446:
---

I created a pull request:
https://github.com/thaichat04/tika/pull/1
https://github.com/binhawking/tika

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-26 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1445:
--
Attachment: TIKA-1445.Palsulich.102614.patch

Here is an updated patch with the above idea. I created a new public method in 
CompositeParser and DefaultParser -- {{getAllParsersFor(ParseContext, 
MediaType}} -- which returns a list of all Parsers that support the given type. 
This list is then searched from TesseractOCRParser for a second Parser for the 
image being parsed.

I created a dummy BodyContentHandler to drop all content from the second Parser.

Thoughts?

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2014-10-26 Thread Abhishek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184895#comment-14184895
 ] 

Abhishek commented on TIKA-1452:


I think there is no missing dependency. Some files have successfully renamed by 
the same code. This file.RenameTo() method is returning false for random files. 
Here is the sample code:- 
public static void main(String args[])
{
InputStream is = null; 
StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
File file = null;
File destination  = null;
try
{
file = new File("E:\\New folder\\inputFile");
System.out.println(file.exists());
destination = new File("E:\\New 
folder\\test\\outputFile");
is = new FileInputStream(file);
parser.parse(is, new WriteOutContentHandler(writer), 
metadata, new ParseContext()); //throws exception for some files.
String contentType = 
metadata.get(Metadata.CONTENT_TYPE);
System.out.println(contentType);

}
catch(Exception e1)
{
e1.printStackTrace();
}

finally
{
try
{
if(is!=null)
{
is.close();
}
writer.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}

}
boolean x = file.renameTo(destination); //x returns false for 
some files.
System.out.println(x);
}

> parser.parse() throws exception after which the procesed file is not getting 
> renamed/moved/deleted
> --
>
> Key: TIKA-1452
> URL: https://issues.apache.org/jira/browse/TIKA-1452
> Project: Tika
>  Issue Type: Bug
>  Components: detector, metadata, parser
>Affects Versions: 1.6
> Environment: jre6
>Reporter: Abhishek
>
> I am passing a file as input stream to parser.parse() method while using 
> apache tika library to convert file to text.The method throws an exception 
> (displayed below) but the input stream is closed in the finally block 
> successfully. Then while renaming the file, the File.renameTo method from 
> java.io returns false. I am not able to rename/delete/move the file despite 
> successfully closing the inputStream. I am afraid another instance of file is 
> created, while parser.parse() method processess the file, which doesn't get 
> closed till the time exception is throw. Is that possible? If so what should 
> I do to rename or delete the file.
> The Exception thrown while checking the content type is
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.adobe.xmp.impl.XMPMetaParser
> at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
> at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
> at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
> at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
> at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
> 
> at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
> at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2014-10-26 Thread Abhishek (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184895#comment-14184895
 ] 

Abhishek edited comment on TIKA-1452 at 10/27/14 6:54 AM:
--

I think there is no missing dependency. Some files have successfully renamed by 
the same code. This file.RenameTo() method is returning false for random files. 
Here is the sample code:- 
public static void main(String args[])
{

InputStream is = null; 
StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
File file = null;
File destination  = null;
try
{
file = new File("E:\\New folder\\inputFile");
System.out.println(file.exists());
destination = new File("E:\\New 
folder\\test\\outputFile");
is = new FileInputStream(file);
parser.parse(is, new WriteOutContentHandler(writer), 
metadata, new ParseContext()); //throws exception for some files.

String contentType = 
metadata.get(Metadata.CONTENT_TYPE);

System.out.println(contentType);

}
catch(Exception e1)
{
e1.printStackTrace();
}

finally
{
try
{
if(is!=null)
{
is.close();
}
writer.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}

}
boolean isRenamed = file.renameTo(destination); //x returns 
false for some files.
System.out.println(isRenamed);
}


was (Author: abhishek):
I think there is no missing dependency. Some files have successfully renamed by 
the same code. This file.RenameTo() method is returning false for random files. 
Here is the sample code:- 
public static void main(String args[])
{
InputStream is = null; 
StringWriter writer = new StringWriter();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
File file = null;
File destination  = null;
try
{
file = new File("E:\\New folder\\inputFile");
System.out.println(file.exists());
destination = new File("E:\\New 
folder\\test\\outputFile");
is = new FileInputStream(file);
parser.parse(is, new WriteOutContentHandler(writer), 
metadata, new ParseContext()); //throws exception for some files.
String contentType = 
metadata.get(Metadata.CONTENT_TYPE);
System.out.println(contentType);

}
catch(Exception e1)
{
e1.printStackTrace();
}

finally
{
try
{
if(is!=null)
{
is.close();
}
writer.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}

}
boolean x = file.renameTo(destination); //x returns false for 
some files.
System.out.println(x);
}

> parser.parse() throws exception after which the procesed file is not getting 
> renamed/moved/deleted
> --
>
> Key: TIKA-1452
> URL: https://issues.apache.org/jira/browse/TIKA-1452
> Project: Tika
>  Issue Type: Bug
>  Components: detector, metadata, parser
>Affects Versions: 1.6
> Environment: jre6
>Reporter: Abhishek
>
> I am passing a file as input stream to parser.parse() method while using 
> apache tika library to convert file to text.The method throws an exception 
> (displayed below) but the input stream is closed in the finally block 
> successfull