pdf acroform and tika

2012-02-23 Thread Allison, Timothy B.
Not sure if this is an issue for PDFBox or Tika, but I noticed that PDFBox's 
textstripper is not extracting information from the form fields in a batch of 
pdf documents I'm processing.  Is anyone else having this problem?
I regret that I'm unable to send an example document.
Inelegant solution with error handling not included:
StringBuilder sb = new StringBuilder();
//get text with text stripper and then
PDDocumentCatalog catalog = pdDoc.getDocumentCatalog();
if (catalog != null){
   PDAcroForm form = catalog.getAcroForm();
   if (form != null){
 ListPDField fields = form.getFields();
 for (PDField field : fields){
sb.append(field.getFullyQualifiedName() +: + 
field.getValue()+\r\n);
 }
   }
}



BodyContentHandler and a docx embedded within a PDF

2013-05-22 Thread Allison, Timothy B.
I have a PDF document with a docx attachment.  I wasn't having luck getting the 
contents of the docx with tika.parseToString(file).

I dug around a bit in the PDFExtractor and found that when I changed this line:
embeddedExtractor.parseEmbedded(
 stream,
new EmbeddedContentHandler(new BodyContentHandler(localHandler)),
metadata, 
false);
to:

embeddedExtractor.parseEmbedded(
 stream,
 new EmbeddedContentHandler(handler),
metadata, 
false);

in other words, when I no longer required body elements, I was able to get 
the content of the attached document.

I attached the same inner document to a docx file and had luck without this 
change.   Does anyone know why this change is required in PDFExtractor?  Is 
this a bad solution?

Unfortunately, I can't share the documents.

   Best,

   Tim



RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 



RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println(CT: +simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element with 
some bean work.  I've opened https://issues.apache.org/jira/browse/TIKA-1150 
for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 



RE: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Allison, Timothy B.
Fixed now.  Build from current trunk (r1526498) or pull from 
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has had 
a chance to build.

Best,

   Tim

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Wednesday, September 25, 2013 6:30 PM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.


Thanks,
Hiroshi Tatsumi



-Original Message- 
From: Allison, Timothy B.
Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details 
below if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be 
committed to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the 
new functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this 
at work.  Your steps will differ somewhat because you're working with xlsx 
vs docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set true to listenForAllRecords field like below, but it didn't work
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
   Locale locale = context.get(Locale.class, Locale.getDefault());
   ExcelExtractor ee = new ExcelExtractor(context);
   ee.setListenForAllRecords(true);
   ee.parse(root, xhtml, locale);
   // original code
   // new ExcelExtractor(context).parse(root, xhtml, locale);
   break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println(CT: +simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi

tika server jax-rs and recursive file processing

2014-04-30 Thread Allison, Timothy B.
All,
  As always, apologies for the cluelessness the following reveals... I'm 
starting to move from embedded Tika to a server option for greater robustness.  
Is the jax-rs server intended not to handle embedded files recursively?  If so, 
how are users currently handling multiply embedded documents with the jax-rs 
server?  Would it be worthwhile to add another service that uses 
AutoDetectParser as the embedded parser/extractor instead of 
MyEmbeddedDocumentExtractor?

Best,

   Tim

Timothy B. Allison, Ph.D.
Lead Artificial Intelligence Engineer
Group Lead
K83A/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)



RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavoritehttp://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info(Content:\t + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.org

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2


RE: Stack Overflow Question

2014-06-30 Thread Allison, Timothy B.
Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavoritehttp://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.info(Content:\t + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.commailto:yeshwant...@gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.orgmailto:d...@tika.apache.org

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2



RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Did you try the ToXMLHandler?

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 4:50 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as


org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75amailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a

org.apache.tika.exception.TikaException: Error creating OOXML extractor


any suggestions regarding these issues,

thanks,
yeshwanth


On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
yeshwant...@gmail.commailto:yeshwant...@gmail.com wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.orgmailto:talli...@mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavoritehttp://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

ContentHandler handler = new DefaultHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = TikaInputStream.get(new File(zipFilePath));

} catch (FileNotFoundException e) {

e.printStackTrace();

}

try {



parser.parse(stream, handler, metadata, context);



logger.infohttp://logger.info(Content:\t + handler.toString());

} catch (IOException e) {

e.printStackTrace();

} catch (SAXException e) {

e.printStackTrace();

} catch (TikaException e) {

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

e.printStackTrace();

}

}

in the logger statement all i see is 
org.xml.sax.helpers.DefaultHandler@5bd8e367mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367

i am missing something, unable to figure it out, looking for some help




-Original Message-

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.commailto:yeshwant...@gmail.com]

Sent: Monday, June 30, 2014 1:28 PM

To: d...@tika.apache.orgmailto:d...@tika.apache.org

Subject: Stack Overflow Question



Unable tp read zipfile using Apache Tika

http://stackoverflow.com/q/24495504/1899893?sem=2






RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Good to hear.  Let us know if you have any other questions or when you run into 
surprises.

From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Tuesday, July 01, 2014 10:23 AM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i forgot to change the BodyContentHandler to ToXMLContentHandler in 
RecursiveMetada, i changed it only in my
calling method,

now i am getting the entire document as the structure u specified.

thanks a ton.

-yeshwanth

On Tue, Jul 1, 2014 at 7:16 PM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Hmmm….

When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see 
this:

div class=embedded id=embed4.zip /
div class=package-entryh1embed4.zip/h1
div class=embedded id=embed4.txt /
div class=package-entryh1embed4.txt/h1
pembed_4/p
/div
/div
/div
/div

That’s a text file inside of a zip file that is itself embedded.  I could see 
doing some parsing on the XML to scrape out div class=”package-entry” 
contents and grab the file name from the h1 element.

If I committed TIKA-1329, would that be of any use to you?   That returns a 
list of metadata objects.  There is one metadata object per embedded file.  The 
text content of each file can be retrieved from each metadata object by this 
key: “tika:content.”

Best,

Tim
From: yeshwanth kumar 
[mailto:yeshwant...@gmail.commailto:yeshwant...@gmail.com]
Sent: Tuesday, July 01, 2014 9:00 AM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

output is same even with ToXMLHandler

On Tue, Jul 1, 2014 at 5:59 PM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Did you try the ToXMLHandler?

From: yeshwanth kumar 
[mailto:yeshwant...@gmail.commailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 4:50 PM

To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

i tried in all possible ways,
instead of reading entire zip file i parsed individual zipentries,
but even then i faced exceptions such as


org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@37ba3e33mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33
Caused by: java.io.IOException: Invalid header signature; read 
0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document

org.apache.tika.exception.TikaException: Unable to unpack document stream

org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.OfficeParser@6f0ee75amailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a

org.apache.tika.exception.TikaException: Error creating OOXML extractor


any suggestions regarding these issues,

thanks,
yeshwanth


On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar 
yeshwant...@gmail.commailto:yeshwant...@gmail.com wrote:

hi tim,

thanks, for sharing the resources but i am unable to figure out how to 
implement it in my code,
what i didn't understand is the flow and recursive steps, when i ran the 
RecursiveMetadataParser
it still giving the same kind of output as filenames combined with content of 
the files,

i am totally confused.

On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Or use the ToXMLHandler and parse the XML?

From: Allison, Timothy B. [mailto:talli...@mitre.orgmailto:talli...@mitre.org]
Sent: Monday, June 30, 2014 3:55 PM
To: yeshwanth kumar
Cc: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: Stack Overflow Question

Might want to look into RecursiveMetadata Parser
http://wiki.apache.org/tika/RecursiveMetadata

Or

https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329serverRenderedViewIssue=true
From: yeshwanth kumar [mailto:yeshwant...@gmail.com]
Sent: Monday, June 30, 2014 3:24 PM
To: Allison, Timothy B.
Subject: Re: Stack Overflow Question

hi tim,

thanks for quick reply,

i changed the contenthandler to bodyContentHandler i got exception for maximum 
word limit,
i used -1 in the bodycontenthandler constructor,

now its another problem, filenames and content are present in string returned 
from handler.tostring()

how can i map a fileName to its content.

thanks,
yeshwanth

On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:

DefaultHandler is effectively a NullHandler; it doesn't store or do anything.



Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler.





If you want to write out each embedded file as a binary, try subclassing 
EmbeddedResourceHandler.



QUOTE:
0down 
votefavoritehttp://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2


i am using Apache Tika 1.5 for parsing the contents present in a zip file,

here's my sample code

Parser parser = new AutoDetectParser();

ParseContext context = new ParseContext();

context.set(Parser.class, parser

RE: How to index the parsed content effectively

2014-07-02 Thread Allison, Timothy B.
Hi Sergey,

  I'd take a look at what the DataImportHandler in Solr does.  If you want to 
store the field, you need to create the field with a String (as opposed to a 
Reader); which means you have to have the whole thing in memory.  Also, if 
you're proposing adding a field entry in a multivalued field for a given SAX 
event, I don't think that will help, because you still have to hold the entire 
document in memory before calling addDocument() if you are storing the field.  
If you aren't storing the field, then you could try a Reader.
 
  Some thoughts:

  At the least, you could create a separate Lucene document for each container 
document and each of its embedded documents.
  
  You could also break large documents into logical sections and index those as 
separate documents; but that gets very use-case dependent.

In practice, for many, many use cases I've come across, you can index quite 
large documents with no problems, e.g. Moby Dick or Dream of the Red 
Chamber.  There may be a hit at highlighting time for large docs depending on 
which highlighter you use.  In the old days, there used to be a 10k default 
limit on the number of tokens, but that is now long gone.
  
  For truly large docs (probably machine generated), yes, you could run into 
problems if you need to hold the whole thing in memory.  
  
 Cheers,

  Tim
-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Wednesday, July 02, 2014 8:27 AM
To: user@tika.apache.org
Subject: How to index the parsed content effectively

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.

This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey


RE: How to index the parsed content effectively

2014-07-14 Thread Allison, Timothy B.
Hi Sergey,

 Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.

In some cases, agreed, but in general, this is probably a good default idea.  
As you point out, you aren't quite duplicating the document -- one copy contain 
the original bytes, and the other contains the text (and metadata?) that was 
extracted from the document.  One reason to store the content in the field is 
for easy highlighting.  You could configure the highlighter to pull the text 
content of the document from a db or other source, but that adds complexity and 
perhaps lookup time.  What you really would not want to do from a time 
perspective is ask Tika to parse the raw bytes to pull the content for 
highlighting at search time.  In general, Lucene's storage of the content is 
very reasonable; on one big batch of text files I have, the Lucene index with 
stored fields is the same size as the uncompressed text files.

So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?

I don't think I quite understand what you're proposing.  The start and end 
indexes of the extracted content?  Wouldn't that just be 0 and the length of 
the string in most cases (beyond-bmp issues aside)?  Or, are you suggesting 
that there may be start and end indexes for content within the actual raw bytes 
of the PDF?  If the latter, for PDFs at least that would effectively require a 
full reparse ... if it were possible, and it probably wouldn't save much in 
time.  For other formats, where that might work, it would create far more 
complexity than value...IMHO.

In general, I'd say store the field.  Perhaps let the user choose to not store 
the field. 

Always interested to hear input from others.

Best,

  Tim


-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Friday, July 11, 2014 1:38 PM
To: user@tika.apache.org
Subject: Re: How to index the parsed content effectively

Hi Tim, All.
On 02/07/14 14:32, Allison, Timothy B. wrote:
 Hi Sergey,

I'd take a look at what the DataImportHandler in Solr does.  If you want 
 to store the field, you need to create the field with a String (as opposed to 
 a Reader); which means you have to have the whole thing in memory.  Also, if 
 you're proposing adding a field entry in a multivalued field for a given SAX 
 event, I don't think that will help, because you still have to hold the 
 entire document in memory before calling addDocument() if you are storing the 
 field.  If you aren't storing the field, then you could try a Reader.

I'd like to ask something about using Tika parser and a Reader (and 
Lucene Store.NO)

Consider a case where we have a service which accepts a very large PDF 
file. This file will be stored on the disk or may be in some DB. And 
this service will also use Tika to extract content and populate a Lucene 
Document.
Now, we already have the original PDF occupying some space, so 
duplicating it (its content) with a Document with Store.YES fields may 
not be the best idea in some cases.

So I wonder, is it possible somehow for a given Tika Parser, lets say a 
PDF parser, report, via the Metadata, the start and end indexes of the 
content ? So the consumer will create say InputStreamReader for a 
content region and will use Store.NO and this Reader ?

Does it really make sense at all ? I can create a minor enhancement 
request for parsers getting the access to a low level info like the 
start/stop delimiters of the content to report it ?

Cheers, Sergey





Some thoughts:

At the least, you could create a separate Lucene document for each 
 container document and each of its embedded documents.

You could also break large documents into logical sections and index those 
 as separate documents; but that gets very use-case dependent.

  In practice, for many, many use cases I've come across, you can index 
 quite large documents with no problems, e.g. Moby Dick or Dream of the Red 
 Chamber.  There may be a hit at highlighting time for large docs depending 
 on which highlighter you use.  In the old days, there used to be a 10k 
 default limit on the number of tokens, but that is now long gone.

For truly large docs (probably machine generated), yes, you could run into 
 problems if you need to hold the whole thing in memory.

   Cheers,

Tim
 -Original Message-
 From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
 Sent: Wednesday, July 02, 2014 8:27 AM
 To: user@tika.apache.org
 Subject: How to index the parsed content effectively

 Hi All,

 We've been experimenting with indexing the parsed content in Lucene and
 our initial attempt was to index the output from

RE: Avoiding Out of Memory Errors

2014-07-18 Thread Allison, Timothy B.
I'm working on adding a daemon to Tika Server so that it will restart when it 
hits an OOM or other big problem (infinite hangs).  That won't be available 
until Tika 1.7.  

To amplify Nick's recommendations:

ForkParser or Server are your best options for now.

Are there specific files/file types that are causing the OOM?  Given the size 
of files, is the OOM surprising?  

On TIKA-1294, we found that a specific 4MB PDF would cause an OOM with -Xmx1g.  
 That was surprising and was very quickly addressed by the PDFBox developers.  
If you have specific files that are surprising, please file an issue.

Thank you!



From: Nick Burch [apa...@gagravarr.org]
Sent: Friday, July 18, 2014 4:32 AM
To: user@tika.apache.org
Subject: Re: Avoiding Out of Memory Errors

On Thu, 17 Jul 2014, Shannon Brown wrote:
 Problem:
 How to avoid Out of Memory errors during Tika parsing.

Typical approaches are either to use the ForkParser, or the Tika Server.
Both ensure that if there's a fatal problem with parsing (eg OOM) then
the JVM with your main application in it doesn't die too

For cases where it does die, log it, and if possible report a bug with the
file in question, so we can hopefully fix it for the next release!

Nick

RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-28 Thread Allison, Timothy B.
+1

Linux version 2.6.32-431.5.1.el6.x86_64: Java 1.6 and 1.7
Windows 7, Java 1.7

I also ran Tika 1.5 and 1.6 rc1 against a random selection of 10,000 docs (all 
formats) plus all available msoffice-x files in govdocs1, yielding 10,413 docs. 
 There were several improvements in text extraction for PDFs (mostly spacing) 
and 4 fewer exceptions (2 ppt, 1 doc and 1 pdf).

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx 

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.init(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:163)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, July 28, 2014 12:22 AM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.6 release candidate #1

Hi Folks,

A candidate for the Tika 1.6 release is available at:

http://people.apache.org/~mattmann/apache-tika-1.6/rc1/


The release candidate is a zip archive of the sources in:

http://svn.apache.org/repos/asf/tika/tags/1.6/

The SHA1 checksum of the archive is
076ad343be56a540a4c8e395746fa4fda5b5b6d3.

A Maven staging repository is available at:

https://repository.apache.org/content/repositories/orgapachetika-1003/


Please vote on releasing this package as Apache Tika 1.6.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.6
[ ] -1 Do not release this package becauseŠ

Thank you!

Cheers,
Chris

P.S. Here is my +1!







RE: Tika - Outlook msg file with another Outlook msg as an attachment - OutlookExtractor passes empty stream

2014-07-31 Thread Allison, Timothy B.
AarKay,

  We have a unit test for an MSG embedded within an MSG in 
POIContainerExtractionTest.  I also just tried a newly created msg within an 
msg file, and I can extract the embedded content with 
TikaTest.RecursiveMetaParser.  This suggests that the issue is not within the 
OutlookParser.

  If you want the bytes of the embedded file, have you tried (or are you using) 
the Unpacker Resource?  IIRC, this gets the attachments (non-recursively!!!) 
out of each doc you send it and sends you back a zip (or tar).  You should be 
able to step through the ZipEntr(ies) and get the original attachment bytes.

   Best,

 Tim
  

-Original Message-
From: AarKay [mailto:ksu.wildc...@gmail.com] 
Sent: Thursday, July 31, 2014 12:30 AM
To: user@tika.apache.org
Subject: Tika - Outlook msg file with another Outlook msg as an attachment - 
OutlookExtractor passes empty stream

I am using Tika Server (TikaJaxRs) for text extraction needs.
I also have a need to extract the attachments in the file and save it to the 
disk in its native format.
I was able to do it by having CustomParser and write the file to disk using 
'stream' in parse method.

Here is the post I used as a reference for building CustomParser.
http://stackoverflow.com/questions/20172465/get-embedded-resourses-in-doc-
files-using-apache-tika

I was able to get it work fine if the attachment is anything but Outlook msg 
file.

I am running into an issue when the attachment is a Outlook msg file.
When CustomParser.parse method gets invoked the stream passed to it is empty 
because of which the file thats being written to disk is always 0 KB.

Digging through the code I noticed that in OutlookExtractor.java class the 
attachment is handled by OfficeParser because msg.attachdata is always null 
when attachment is a Outlook msg and thats where it is always sending empty 
stream to CustomParser.

Here is the snippet of code from OutlookExtractor where it iterates through 
Attachment files and uses handleEmbeddedResource method only when 
msg.attachData is not null.
But msg.attachData is always null if the Attachment is of type Outlook msg 
because of which stream is always empty when delegating the request to 
CustomParser.parse method.

Can someone please tell me how can i access the msg attachment and save it 
to disk in its Native format?

for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
   xhtml.startElement(div, class, attachment-entry);  
 
   String filename = null;
   if (attachment.attachLongFileName != null) {
  filename = attachment.attachLongFileName.getValue();
   } else if (attachment.attachFileName != null) {
  filename = attachment.attachFileName.getValue();
   }
   if (filename != null  filename.length()  0) {
   xhtml.element(h1, filename);
   }   
   if(attachment.attachData != null) {
  handleEmbeddedResource(
TikaInputStream.get(attachment.attachData.getValue()),
filename,
null, xhtml, true
  );
   }
   if(attachment.attachmentDirectory != null) {
  handleEmbededOfficeDoc(
attachment.attachmentDirectory.getDirectory(),
xhtml
  );
   }
   xhtml.endElement(div);   
   }


Thanks
-AarKay



RE: Apache Tika - JSON?

2014-09-26 Thread Allison, Timothy B.
I suspect, though, that what you want is not what I answered (sorry!)…namely 
entities mapped from xhtml to json.  For that, I don’t think we have anything 
available in Tika, but it wouldn’t be difficult (famous last words) to write a 
content handler to do that…

We have integrated the GSON library to serialize/deserialize Metadata objects 
in tika-serialization.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, September 26, 2014 6:54 AM
To: user@tika.apache.org
Subject: RE: Apache Tika - JSON?

The current json output option in the app and server only dump metadata…as you 
probably know.

I plan to add a json version of the RecursiveParserWrapper (list of Metadata 
objects with one entry for content) to the app shortly.  Would that be of any 
use?

Are you using the app, the server, or calling Tika programmatically?


From: Vineet Ghatge Hemantkumar [mailto:heman...@usc.edu]
Sent: Thursday, September 25, 2014 11:06 PM
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: Apache Tika - JSON?

Hello all,

I was wondering if there any in built parser to get help in conversion from 
XHTML to JSON.

My research showed that there is one named org.apache.io.json which just one 
method implemented. Also, I tried GJSON library to do this, but it does not 
seem to work with Tika. Any suggestions will be appreciated?

Regards,
Vineet


RE: Problem with content extraction

2014-10-07 Thread Allison, Timothy B.
I’ve seen this before on a few documents.  You might experiment with setting 
PDFParserConfig’s suppressDuplicateOverlappingText to true.  If that doesn’t 
work, I’d recommend running the pure PDFBox app’s ExtractText on the document.  
If you get the same doubling of letters, ask over on 
u...@pdfbox.apache.orgmailto:u...@pdfbox.apache.org.  If you don’t, let us 
know!

Best,

   Tim


From: Mohammad Ghufran [mailto:emghuf...@gmail.com]
Sent: Tuesday, October 07, 2014 8:37 AM
To: user@tika.apache.org
Subject: Problem with content extraction

Hello,

I am using tika to extract content of documents using tika but I've run into a 
problem. In some documents, the characters in the output are repeated several 
times. For example, while processing a PDF file, the text FORMATION is 
transformed into FFOORRMMAATTIIOONN and so on.

I tried looking through the mailing lists but didn't find any reference to 
this. I also tried with the latest version of tika but it results in the same 
output.

The only thing i can notice is that the document seems to have text written 
with some shadow - if it is useful.

I would like to know if someone has encountered this  problem before and what 
are the possible solutions, if any.

Best Regards,
Ghufran


RE: Customizing Metadata Keys

2014-10-09 Thread Allison, Timothy B.
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to 
put in a plug for the RecursiveParserWrapper, which may be of use for you.  
I’ve been intending to add that to the app commandline and to server…how are 
you handling embedded document metadata?  Would the wrapper be of any use or do 
you not have any embedded docs in your doc set?

I’ve also been meaning to dump counts of metadata keys from the govdocs1 
corpus, would that be of any use, or do you already know the keys that you care 
about?

Cheers,

 Tim
From: Can Duruk [mailto:c...@duruk.net]
Sent: Thursday, October 09, 2014 12:13 PM
To: user@tika.apache.org
Subject: Re: Customizing Metadata Keys

I'd suggest you do the mapping from Tika keys to your keys in the server.
All the parsers should return consistent keys, so the output side is
the
best place to map.

That seems to be the now-obvious solution, thanks for the suggestion.

 Perhaps a re-mapping downstream ContentHandler
 that takes in the Metadata object and will reformat
 the meta name=.. section of the XHTML?

I've tried a way to add a step late in the pipeline I'm not super familiar with 
the Tika codebase so got lost a bit. Any pointers (examples / tutorials) you 
could guide me towards? Chapters in the Tika book? I want to explore this if 
the server idea doesn't pan out.

On Wed, Oct 8, 2014 at 10:25 PM, Chris Mattmann 
chris.mattm...@gmail.commailto:chris.mattm...@gmail.com wrote:

 Perhaps a re-mapping downstream ContentHandler
 that takes in the Metadata object and will reformat
 the meta name=.. section of the XHTML?


 
 Chris Mattmann
 chris.mattm...@gmail.commailto:chris.mattm...@gmail.com




 -Original Message-
 From: Nick Burch apa...@gagravarr.orgmailto:apa...@gagravarr.org
 Reply-To: user@tika.apache.orgmailto:user@tika.apache.org
 Date: Thursday, October 9, 2014 at 12:32 PM
 To: user@tika.apache.orgmailto:user@tika.apache.org
 Subject: Re: Customizing Metadata Keys

 On Wed, 8 Oct 2014, Can Duruk wrote:
  My question is regarding setting the metadata keys coming from the
 parsers
  to my own keys.
 
  For my application, I am using Tika to extract the metadata for a bunch
 of
  files. I am using the embedded HTTP server which I modified for my
 needs to
  return instead of CSV. (Hoping to submit that as a patch soon)
 
  However, the keys in the JSON are all in different formats and I need
 them
  to conform to my own requirements.
 
 I'd suggest you do the mapping from Tika keys to your keys in the server.
 All the parsers should return consistent keys, so the output side is
 the
 best place to map. Trying to do it in each parser would be much more
 work.
 Just put the mapping in between where you call the parser, and where you
 output
 
 Nick




internal vs external property?

2014-11-20 Thread Allison, Timothy B.
All,
  What is the difference between an internal and an external Property?  I'm not 
(quickly) seeing how Metadata is using that Boolean.  Are there other pieces of 
code that make use of the distinction?
  Thank you.

 Best,

Tim



RE: Encrypted PDF issues build issues

2014-12-11 Thread Allison, Timothy B.
Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was 
just taken for that, so that will be out very soon.  Last I looked at 
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test 
(I think?) in Tika…which is why you’re getting a failed build.  Your error 
message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 
1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build 
with 1.8.8-SNAPSHOT, let me know.

Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 11, 2014 12:43 PM
To: user@tika.apache.org
Subject: Encrypted PDF issues  build issues

Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at 
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts 
the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64   1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64   1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread main org.apache.tika.exception.TikaException: Unable to 
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input 
length must be multiple of 16 when decrypting with padded cipher
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
at 
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
at 
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be 
multiple of 16 when decrypting with padded cipher
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
at javax.crypto.Cipher.doFinal(Cipher.java:1805)
at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
... 19 more




I searched the pdfbox issue tracker and found 
https://issues.apache.org/jira/browse/PDFBOX-2469 and 
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to 
related issues. The ticket status says a number of these issues are fixed in 
the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set 
pdfbox.version1.8.8-SNAPSHOT/pdfbox.version. I also edit 
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
 and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build 
either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin 

RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
Do you have any luck if you call /metadata instead of /meta?

That should trigger MetadataEP which will return Json, no?

I'm not sure why we have both handlers, but we do...


-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Thursday, December 18, 2014 9:56 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Hi Peter
Thanks, you are too nice, it is a minor bug :-)
Cheers, Sergey
On 18/12/14 14:50, Peter Bowyer wrote:
 Thanks Sergey, I have opened TIKA-1497 for this enhancement.

 Best wishes,
 Peter

 On 18 December 2014 at 14:31, Sergey Beryozkin sberyoz...@gmail.com
 mailto:sberyoz...@gmail.com wrote:

 Hi,
 I see MetadataResource returning StreamingOutput and it has
 @Produces(text/csv) only. As such this MBW has no effect at the moment.

 We can update MetadataResource to return Metadata directly if
 application/json is requested or update MetadataResource to directly
 convert Metadata to JSON in case of JSON being accepted

 Can you please open a JIRA issue ?

 Cheers, Sergey



 On 18/12/14 13:58, Peter Bowyer wrote:

 Hi,

 I suspect this has a really simple answer, but it's eluding me.

 How do I get the response from
 curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
 to be JSON and not CSV?

 I've discovered JSONMessageBodyWriter.java
 
 (https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessageBodyWriter.__java
 
 https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
 so I think the functionality is present, tried adding --header
 Accept:
 application/json to the cURL call, in line with the
 documentation for
 outputting CSV, but no luck so far.

 Many thanks,
 Peter




 --
 Maple Design Ltd
 http://www.mapledesign.co.uk
 http://www.mapledesign.co.uk/+44 (0)845 123 8008

 Reg. in England no. 05920531




Tika 2.0???

2014-12-18 Thread Allison, Timothy B.
I feel Tika 2.0 coming up soon (well, April-ish?!) and the breaking of some 
other areas of back compat, esp. parser class loading - config ... 

What other areas for breaking or revamping do others see for 2.0?

We need a short-term fix to get the tesseract ocr integration+metadata out the 
door with 1.7, of course.


-Original Message-
From: Chris Mattmann [mailto:chris.mattm...@gmail.com] 
Sent: Thursday, December 18, 2014 10:42 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Yeah I think we should probably combine them..and make
JSON the default (which unfortunately would break back
compat, but in my mind would make a lot more sense)


Chris Mattmann
chris.mattm...@gmail.com




-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: user@tika.apache.org
Date: Thursday, December 18, 2014 at 7:20 AM
To: user@tika.apache.org user@tika.apache.org
Subject: RE: Outputting JSON from tika-server/meta

Do you have any luck if you call /metadata instead of /meta?

That should trigger MetadataEP which will return Json, no?

I'm not sure why we have both handlers, but we do...


-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, December 18, 2014 9:56 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

Hi Peter
Thanks, you are too nice, it is a minor bug :-)
Cheers, Sergey
On 18/12/14 14:50, Peter Bowyer wrote:
 Thanks Sergey, I have opened TIKA-1497 for this enhancement.

 Best wishes,
 Peter

 On 18 December 2014 at 14:31, Sergey Beryozkin sberyoz...@gmail.com
 mailto:sberyoz...@gmail.com wrote:

 Hi,
 I see MetadataResource returning StreamingOutput and it has
 @Produces(text/csv) only. As such this MBW has no effect at the
moment.

 We can update MetadataResource to return Metadata directly if
 application/json is requested or update MetadataResource to directly
 convert Metadata to JSON in case of JSON being accepted

 Can you please open a JIRA issue ?

 Cheers, Sergey



 On 18/12/14 13:58, Peter Bowyer wrote:

 Hi,

 I suspect this has a really simple answer, but it's eluding me.

 How do I get the response from
 curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
 to be JSON and not CSV?

 I've discovered JSONMessageBodyWriter.java
 
(https://github.com/apache/__tika/blob/__af19f3ea04792cad81b428f1df9f5e__
bbb2501913/tika-server/src/__main/java/org/apache/tika/__server/JSONMessa
geBodyWriter.__java
 
https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb250
1913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWrit
er.java)
 so I think the functionality is present, tried adding --header
 Accept:
 application/json to the cURL call, in line with the
 documentation for
 outputting CSV, but no luck so far.

 Many thanks,
 Peter




 --
 Maple Design Ltd
 http://www.mapledesign.co.uk
 http://www.mapledesign.co.uk/+44 (0)845 123 8008

 Reg. in England no. 05920531






RE: Outputting JSON from tika-server/meta

2014-12-18 Thread Allison, Timothy B.
Doh!  K, looks like we aren’t loading that in TikaServerCLI.

Does anyone know how we’re using MetadataEP?

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 18, 2014 10:57 AM
To: user@tika.apache.org
Subject: Re: Outputting JSON from tika-server/meta

On 18 December 2014 at 15:20, Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org wrote:
Do you have any luck if you call /metadata instead of /meta?

I have no luck with that:

Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod
WARNING: No operation matching request path /metadata is found, Relative 
Path: /metadata, HTTP Method: PUT, ContentType: */*, Accept: */*,. Please 
enable FINE/TRACE log level for more details.
Dec 18, 2014 3:55:21 PM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper 
toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 404 Not Found
at 
org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
at 
org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:157)
at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:526)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:177)
at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:77)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:243)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

Best regards,
Peter


RE: Outputting JSON from tika-server/meta

2014-12-19 Thread Allison, Timothy B.
All,

With many thanks to Sergey, I added JSON and XMP to “/meta” and I folded in 
MetadataEP into MetadataResource so that users can request a specific metadata 
value(s). (TIKA-1497, TIKA-1499)

I also added a new endpoint “/rmeta” that is equivalent to tika-app’s –J 
(TIKA-1498) – JSONified view of a list of metadata objects representing the 
container document and all embedded docs…aka Jukka and Nick’s 
RecursiveParserWrapper.

I also updated the jax-rs wiki to reflect these changes.

Please kick the tires and let us know if there are any surprises.

Best,

   Tim
From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, December 18, 2014 8:58 AM
To: user@tika.apache.org
Subject: Outputting JSON from tika-server/meta

Hi,

I suspect this has a really simple answer, but it's eluding me.

How do I get the response from
curl -X PUT -T /path/to/file.pdf http://localhost:9998/meta
to be JSON and not CSV?

I've discovered JSONMessageBodyWriter.java 
(https://github.com/apache/tika/blob/af19f3ea04792cad81b428f1df9f5ebbb2501913/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java)
 so I think the functionality is present, tried adding --header Accept: 
application/json to the cURL call, in line with the documentation for 
outputting CSV, but no luck so far.

Many thanks,
Peter


RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Peter,
  I don’t have any immediate solutions, but there are two options in the 
pipeline (probably Tika 1.8):


1)  Lewis John McGibbney on TIKA-894 is going to add a war/webapp.

2)  I plan to open an issue related to TIKA-1330 that will make our current 
jax-rs tika-server more robust to OOM and permanent hangs, i.e. the server 
process will shut itself down if it encounters either of these, and a watcher 
process will restart the server process… as currently happens in the dev 
version of TIKA-1330.

  This is an interest close to my heart, and I look forward to hearing how 
others are handling this.

  Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, January 08, 2015 6:47 AM
To: user@tika.apache.org
Subject: Running tika-server as a service

Hi,

I want to ensure tika-server is always running, and continues to after restarts 
etc.

I have a hacked together an init script (this being CentOS release 6.6) that 
seems to work (it's running, though not restarted the server yet to test) but 
it's an ugly way to manage things.

How do you keep tika-server running? A daemon manager like daemon tools?  
Handcrafted init.d/upstart/systemd scripts? Is anyone able to share what they 
use?

Thanks,
Peter


RE: Running tika-server as a service

2015-01-08 Thread Allison, Timothy B.
Doh!  My answer focused on my interests rather than your question.  Sorry.  By 
restart, I now assume you mean system restart…  TIKA-894 should help with that 
if you configure your server container (tomcat?) to automatically start/restart.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, January 08, 2015 8:28 AM
To: user@tika.apache.org
Subject: RE: Running tika-server as a service

Peter,
  I don’t have any immediate solutions, but there are two options in the 
pipeline (probably Tika 1.8):


1)  Lewis John McGibbney on TIKA-894 is going to add a war/webapp.

2)  I plan to open an issue related to TIKA-1330 that will make our current 
jax-rs tika-server more robust to OOM and permanent hangs, i.e. the server 
process will shut itself down if it encounters either of these, and a watcher 
process will restart the server process… as currently happens in the dev 
version of TIKA-1330.

  This is an interest close to my heart, and I look forward to hearing how 
others are handling this.

  Best,

   Tim

From: Peter Bowyer [mailto:pe...@mapledesign.co.uk]
Sent: Thursday, January 08, 2015 6:47 AM
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: Running tika-server as a service

Hi,

I want to ensure tika-server is always running, and continues to after restarts 
etc.

I have a hacked together an init script (this being CentOS release 6.6) that 
seems to work (it's running, though not restarted the server yet to test) but 
it's an ugly way to manage things.

How do you keep tika-server running? A daemon manager like daemon tools?  
Handcrafted init.d/upstart/systemd scripts? Is anyone able to share what they 
use?

Thanks,
Peter


JAX-RS: SEVERE Problem with writing the data when parser hits exception?

2015-02-27 Thread Allison, Timothy B.
All,

I recently noticed that I'm getting this message logged when there is an 
exception during parsing:

SEVERE: Problem with writing the data, class 
org.apache.tika.server.TikaResource$5, ContentType: text/html

We didn't get this message with Tika 1.6, but we are getting this with Tika 1.7 
and trunk.
Is this to be expected?

Full stack trace is below.  The test document that triggered this is an 
encrypted PDF document.




WARNING: tika: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:117
)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256
)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
at org.apache.tika.server.TikaResource$5.write(TikaResource.java:368)
at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataPr
ovider.java:164)
at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.jav
a:1363)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage
(JAXRSOutInterceptor.java:244)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(
JAXRSOutInterceptor.java:117)
at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JA
XRSOutInterceptor.java:80)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(Out
goingChainInterceptor.java:83)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseIntercept
orChain.java:307)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainIniti
ationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(Abstract
HTTPDestination.java:251)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(Je
ttyHTTPDestination.java:261)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTP
Handler.java:70)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
er.java:1088)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
r.java:1024)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:135)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
extHandlerCollection.java:255)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
tHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpC
onnection.java:982)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.conten
t(AbstractHttpConnection.java:1043)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)

at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnecti
on.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEn
dPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEnd
Point.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
l.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:22
5)
at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.ja
va:117)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:251)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
va:460)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.j
ava:385)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java
:344)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 35 more
Caused by: 

RE: Odp.: solr issue with pdf forms

2015-04-29 Thread Allison, Timothy B.
I completely agree with Erick about the utility of the TermsComponent to see 
what is actually being indexed.  If you find problems there and if you haven't 
done so already, you might also investigate further down the stack.  It might 
make sense to run the tika-app.jar (whichever version you are using in DIH or 
other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files 
outside of Solr to see what text/noise you're getting for the files that are 
causing problems.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 28, 2015 9:07 PM
To: solr-u...@lucene.apache.org
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1 go to the admin UI
2 select a core
3 select schema browser
4 select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  steve.sch...@t-systems.com wrote:
 Thanks a lot for being patient with me. Unfortunately there is no button 
 load term info. :-(
 Can you may be help me using the TermsComponent instead? I read it is per 
 default configured.

 Thanks a lot
 Best
 Steve

 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Montag, 27. April 2015 17:23
 An: solr-u...@lucene.apache.org
 Betreff: Re: Odp.: solr issue with pdf forms

 We're still not quite there. There should be a load term info button on 
 that page. Clicking that button will show you the terms in your index (as 
 opposed to the raw stored input which is what you get when you look at 
 results in the browser). My bet is that you'll see perfectly normal tokens in 
 the index that will NOT have the wonky characters you see in the display.

 If that's the case, then you have a browser issue, Solr is working perfectly 
 fine. On the other hand, if the individual terms are weird, then you have 
 something more fundamental going on.

 Which is why I mentioned the TermsComponent. That will return indexed tokens, 
 and allows you a bit more flexibility than the admin page in terms of what 
 tokens you see, but it's essentially the same information.

 Best,
 Erick

 On Sun, Apr 26, 2015 at 11:18 PM,  steve.sch...@t-systems.com wrote:
 Erick,

 thanks a lot for helping me here. In my case it ist he content field which 
 is displayed not correctly. So I went tot he schema browser like you pointed 
 out. Here ist he information I found:
 Field: content
 Field Type: text
 Properties:  Indexed, Tokenized, Stored, TermVector Stored
 Schema:  Indexed, Tokenized, Stored, TermVector Stored
 Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
 spell teaser Position Increment Gap:  100 Index Analyzer:
 org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
 org.apache.solr.analysis.WhitespaceTokenizerFactory
 Filters:
 org.apache.solr.analysis.WordDelimiterFilterFactory
 args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
 catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
 catenateAll: 0 catenateNumbers: 1 }
 org.apache.solr.analysis.LowerCaseFilterFactory
 args:{luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
 german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
 LUCENE_36 }
 org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
 args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
 minWordSize: 5 dictionary: german/german-common-nouns.txt
 luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.StopFilterFactory args:{words:
 german/stopwords.txt ignoreCase: true enablePositionIncrements: true
 luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.GermanNormalizationFilterFactory
 args:{luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
 german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
 args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
 org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
 org.apache.solr.analysis.WhitespaceTokenizerFactory
 Filters:
 org.apache.solr.analysis.WordDelimiterFilterFactory
 args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
 catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
 catenateAll: 0 catenateNumbers: 0 }
 org.apache.solr.analysis.LowerCaseFilterFactory
 args:{luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.StopFilterFactory args:{words:
 german/stopwords.txt ignoreCase: true enablePositionIncrements: true
 luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.GermanNormalizationFilterFactory
 args:{luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
 german/protwords.txt 

RE: Odp.: solr issue with pdf forms

2015-04-30 Thread Allison, Timothy B.
Is that a literal ^ followed by H?  Out of curiosity, is 
Bitte^Hlegen^HSie^Hdem^HAntrag indexed as one token, or is it indexed as (I 
guess it depends on your analysis chain...):
 
Bitte
Hlegen
HSie
Hdem
HAntrag

Might want to open an issue on PDFBox's jira.  Some things can be easily fixed; 
sometimes the text with the PDF file is just plain corrupt. :)

Cheers,

Tim

-Original Message-
From: steve.sch...@t-systems.com [mailto:steve.sch...@t-systems.com] 
Sent: Thursday, April 30, 2015 3:03 AM
To: solr-u...@lucene.apache.org
Subject: AW: Odp.: solr issue with pdf forms

Hey, thanks a lot for the hint with pdfbox-app.jar.
For testing purpose I now extracted a affected pdf form and a usual pdf file.
The result ist he following:

Usual pdf file:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod 
tempor invidunt ut
labore et d

pdf form:
Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz

Best
Steve

-Ursprüngliche Nachricht-
Von: Allison, Timothy B. [mailto:talli...@mitre.org] 
Gesendet: Mittwoch, 29. April 2015 14:16
An: solr-u...@lucene.apache.org
Cc: user@tika.apache.org
Betreff: RE: Odp.: solr issue with pdf forms

I completely agree with Erick about the utility of the TermsComponent to see 
what is actually being indexed.  If you find problems there and if you haven't 
done so already, you might also investigate further down the stack.  It might 
make sense to run the tika-app.jar (whichever version you are using in DIH or 
other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files 
outside of Solr to see what text/noise you're getting for the files that are 
causing problems.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, April 28, 2015 9:07 PM
To: solr-u...@lucene.apache.org
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1 go to the admin UI
2 select a core
3 select schema browser
4 select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  steve.sch...@t-systems.com wrote:
 Thanks a lot for being patient with me. Unfortunately there is no 
 button load term info. :-( Can you may be help me using the TermsComponent 
 instead? I read it is per default configured.

 Thanks a lot
 Best
 Steve

 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Montag, 27. April 2015 17:23
 An: solr-u...@lucene.apache.org
 Betreff: Re: Odp.: solr issue with pdf forms

 We're still not quite there. There should be a load term info button on 
 that page. Clicking that button will show you the terms in your index (as 
 opposed to the raw stored input which is what you get when you look at 
 results in the browser). My bet is that you'll see perfectly normal tokens in 
 the index that will NOT have the wonky characters you see in the display.

 If that's the case, then you have a browser issue, Solr is working perfectly 
 fine. On the other hand, if the individual terms are weird, then you have 
 something more fundamental going on.

 Which is why I mentioned the TermsComponent. That will return indexed tokens, 
 and allows you a bit more flexibility than the admin page in terms of what 
 tokens you see, but it's essentially the same information.

 Best,
 Erick

 On Sun, Apr 26, 2015 at 11:18 PM,  steve.sch...@t-systems.com wrote:
 Erick,

 thanks a lot for helping me here. In my case it ist he content field which 
 is displayed not correctly. So I went tot he schema browser like you pointed 
 out. Here ist he information I found:
 Field: content
 Field Type: text
 Properties:  Indexed, Tokenized, Stored, TermVector Stored
 Schema:  Indexed, Tokenized, Stored, TermVector Stored
 Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
 spell teaser Position Increment Gap:  100 Index Analyzer:
 org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
 org.apache.solr.analysis.WhitespaceTokenizerFactory
 Filters:
 org.apache.solr.analysis.WordDelimiterFilterFactory
 args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
 catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
 catenateAll: 0 catenateNumbers: 1 }
 org.apache.solr.analysis.LowerCaseFilterFactory
 args:{luceneMatchVersion: LUCENE_36 } 
 org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
 german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
 LUCENE_36 }
 org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
 args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
 minWordSize: 5 dictionary: german/german-common-nouns.txt
 luceneMatchVersion: LUCENE_36 }
 org.apache.solr.analysis.StopFilterFactory args:{words:
 german/stopwords.txt ignoreCase: true

FW: TIKA OCR not working

2015-04-27 Thread Allison, Timothy B.
Trung,

I haven't experimented with our OCR parser yet, but this should give a good 
start: https://wiki.apache.org/tika/TikaOCR .

Have you installed tesseract?

Tika colleagues,
  Any other tips?  What else has to be configured and how?

-Original Message-
From: trung.ht [mailto:trung...@anlab.vn] 
Sent: Friday, April 24, 2015 11:22 PM
To: solr-u...@lucene.apache.org
Subject: Re: TIKA OCR not working

HI everyone,

Does anyone have the answer for this problem :)?


I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
 but it looks like it does not work. Does anyone know that TIKA OCR works
 automatically with Solr or I have to change some settings?


Trung.


 It's not clear if OCR would happen automatically in Solr Cell, or if
 changes to Solr would be needed.

 For Tika OCR info, see:

 https://issues.apache.org/jira/browse/TIKA-93
 https://wiki.apache.org/tika/TikaOCR



 -- Jack Krupansky

 On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch 
 arafa...@gmail.com
 wrote:

  I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
 it
  in use yet.
 
  Regards,
  Alex
  On 23 Apr 2015 10:24 pm, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
   Hi Trung,
  
   I didn't know about OCR capabilities of tika.
   Someone who is familiar with sold-cell can inform us whether this
   functionality is added to solr or not.
  
   Ahmet
  
  
  
   On Thursday, April 23, 2015 2:06 PM, trung.ht trung...@anlab.vn
 wrote:
   Hi Ahmet,
  
   I used a png file, not a pdf file. From the document, I understand
 that
   solr will post the file to tika, and since tika 1.7, OCR is included.
 Is
   there something I misunderstood.
  
   Trung.
  
  
   On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
 iori...@yahoo.com.invalid
  
   wrote:
  
Hi Trung,
   
solr-cell (tika) does not do OCR. It cannot exact text from image
 based
pdfs.
   
Ahmet
   
   
   
On Thursday, April 23, 2015 7:33 AM, trung.ht trung...@anlab.vn
  wrote:
   
   
   
Hi,
   
I want to use solr to index some scanned document, after settings
 solr
document with a two field content and filename, I tried to
 upload
  the
attached file, but it seems that the content of the file is only
 \n \n
\n.
But if I used the tesseract from command line I got the result
  correctly.
   
The log when solr receive my request:
---
INFO  - 2015-04-23 03:49:25.941;
org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
webapp=/solr path=/update/extract params={literal.groupid=2json.nl
   =flat
resource.name=phplNiPrsliteral.id
   
  
 
 =4commit=trueextractOnly=falseliteral.historyid=4omitHeader=trueliteral.userid=3literal.createddate=2015-04-22T15:00:00Zfmap.content=contentwt=jsonliteral.filename=\\trunght\test\tesseract_3.png}
   

   
The document when I check on solr admin page:
-
{ groupid: 2, id: 4, historyid: 4, userid: 3,
 createddate:
2015-04-22T15:00:00Z, filename:
  trunght\\test\\tesseract_3.png,
autocomplete_text: [ trunght\\test\\tesseract_3.png ],
   content: 
\n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
 \n
  \n
\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
 \n
  ,
_version_: 1499213034586898400 }
   
---
   
Since I am a solr newbie I do not know where to look, can anyone
 give
  me
an advice for where to look for error or settings to make it work.
Thanks in advanced.
   
Trung.
   
  
 





RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
This sounds like a Tika issue, let's move discussion to that list.

If you are still having problems after you upgrade to Tika 1.8, please at least 
submit the stack traces (if you can) to the Tika jira.  We may be able to find 
a document that triggers that stack trace in govdocs1 or the slice of 
CommonCrawl that Julien Nioche contributed to our eval effort.

Tika is not perfect and it will fail on some files, but we are always working 
to improve it.

Best,

  Tim

-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:44 AM
To: solr-u...@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks  Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

 I entirely agree with Erick -- it is best to isolate Tika in its own jvm
 if you can -- bad things can happen if you don't [1] [2].

 Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
 embedded documents/attachments, make sure to set the parser in the
 ParseContext before parsing:

 ParseContext context = new ParseContext();
 //add this line:
 context.set(Parser.class, _autoParser)
  InputStream input = new FileInputStream(file);

 Tika 1.8 is soon to be released.  If that doesn't fix your problems,
 please submit stacktraces (and docs, if possible) to the Tika jira, and
 we'll try to make the fixes.

 Cheers,

 Tim

 [1]
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
 [2]
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:10 AM
 To: solr-u...@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
 are getting parsed properly and indexed into Solr. However, a minority of
 them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents can be
 indexed. The business requirement we have is to index all the documents.
 However, if a small percentage of them fails, not sure what other ways
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

  There's quite a discussion here:
  https://issues.apache.org/jira/browse/SOLR-7137
 
  But, I personally am not a huge fan of pushing all the work on to Solr,
 in
  a
  production environment the Solr server is responsible for indexing,
  parsing the
  docs through Tika, perhaps searching etc. This doesn't scale all that
 well.
 
  So an alternative is to use SolrJ with Tika, which is totally independent
  of
  what version of Tika is on the Solr server. Here's an example.
 
  http://lucidworks.com/blog/indexing-with-solrj/
 
  Best,
  Erick
 
  On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
  vijaya.bhoomire...@whishworks.com wrote:
   Thanks everyone for the responses. Now I am able to index PDF documents
   successfully. I have implemented manual extraction using Tika's
  AutoParser
   and PDF functionality is working fine. However,  the error with some MS
   office word documents still persist.
  
   The error message is java.lang.IllegalArgumentException: This
 paragraph
  is
   not the first one in the table which will eventually result in
  Unexpected
   RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
  
   Upon some reading, it looks like its a bug with Tika 1.5 and seems to
  have
   been fixed with Tika 1.6 (
  https://issues.apache.org/jira/browse/TIKA-1251 ).
   I am new to Solr / Tika and hence wondering whether I can change the
 Tika
   library alone to v1.6 without impacting any of the libraries within
 Solr
   4.10.2? Please let me know your response and how to get away with this
   issue.
  
   Many thanks in advance.
  
   Thanks  Regards
   Vijay
  
  
   On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
  
   Vijay,
  
   You could try different excel files with different formats to rule out
  the
   issue is with TIKA version being used.
  
   Thanks
   Murthy
  
   On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
   wrote

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
Let's move this to the Tika users' list.  

I'm aware that [1] is quite common in govdocs1, and it might (?) be the source 
of your problem with MSWord files.

If you can share a stack trace, we'll be better able to diagnose.  

Best,

Tim


[1]
org.apache.poi.hslf.exceptions.HSLFException: java.util.zip.ZipException: 
unknown compression method
at org.apache.poi.hslf.blip.WMF.getData(WMF.java:65)
at 
o.a.t.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:264)



-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 9:17 AM
To: solr-u...@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.

For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.


Thanks  Regards
Vijay



On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Thanks Tim.

 I shall raise a Jira with the stack trace information.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

 This sounds like a Tika issue, let's move discussion to that list.

 If you are still having problems after you upgrade to Tika 1.8, please at
 least submit the stack traces (if you can) to the Tika jira.  We may be
 able to find a document that triggers that stack trace in govdocs1 or the
 slice of CommonCrawl that Julien Nioche contributed to our eval effort.

 Tika is not perfect and it will fail on some files, but we are always
 working to improve it.

 Best,

   Tim

 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:44 AM
 To: solr-u...@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Thanks Allison.

 I tried with the mentioned changes. But still no luck. I am using the code
 from lucidworks site provided by Erick and now included the changes
 mentioned by you. But still the issue persists with a small percentage of
 documents (both PDF and MS Office documents) failing. Unfortunately, these
 documents are proprietary and client-confidential and hence I am not sure
 whether they can be uploaded into Jira.

 These files normally open in Adobe Reader and MS Office tools.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org
 wrote:

  I entirely agree with Erick -- it is best to isolate Tika in its own jvm
  if you can -- bad things can happen if you don't [1] [2].
 
  Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
  embedded documents/attachments, make sure to set the parser in the
  ParseContext before parsing:
 
  ParseContext context = new ParseContext();
  //add this line:
  context.set(Parser.class, _autoParser)
   InputStream input = new FileInputStream(file);
 
  Tika 1.8 is soon to be released.  If that doesn't fix your problems,
  please submit stacktraces (and docs, if possible) to the Tika jira, and
  we'll try to make the fixes.
 
  Cheers,
 
  Tim
 
  [1]
 
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
  [2]
 
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
  -Original Message-
  From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
  vijaya.bhoomire...@whishworks.com]
  Sent: Thursday, April 16, 2015 7:10 AM
  To: solr-u...@lucene.apache.org
  Subject: Re: Indexing PDF and MS Office files
 
  Erick,
 
  I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
  SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
  are getting parsed properly and indexed into Solr. However, a minority
 of
  them keep failing wither PDFParser or OfficeParser error.
 
  Not sure if this behaviour can be modified so that all the documents
 can be
  indexed. The business requirement we have is to index all the documents.
  However, if a small percentage of them fails, not sure what other ways
  exist to index them.
 
  Any help please?
 
 
  Thanks  Regards
  Vijay
 
 
 
  On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
 wrote:
 
   There's quite a discussion here:
   https://issues.apache.org/jira/browse/SOLR-7137
  
   But, I personally am not a huge fan of pushing all the work on to
 Solr,
  in
   a
   production environment the Solr server is responsible for indexing,
   parsing the
   docs through Tika, perhaps searching etc. This doesn't scale all that
  well.
  
   So

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
1)  Right, the npe is caused by the exception returning null when we call 
getMessage().  In TIKA-1605, we modified all code in the project to check for 
null returned by getMessage().  So, in the fixed version, you'll still get 
your good old IOException.  I can't tell from your stacktrace what caused the 
IOException.

2)  Y, regular builds of 1.9's app (and other modules) are available via 
Jenkins here: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/org.apache.tika$tika-app/

3)  Ok, makes sense.

For kicks, you may want to change opening the file to:
is = TikaInputStream.get(file)
or maybe:
is = TikaInputStream.get(file, metadata)

And you'll want to surround your closing of the IS in a try/catch block.  Or 
use IOUtils.closeQuietly.

Finally, are you able to share the particular file that caused the IOException?
From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 10:20 AM
To: Allison, Timothy B.; talli...@apache.org
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Hi Timothy,
Thanks for the prompt reply.


1.)Wouldn't fixing the null pointer exception in turn throw the IO 
exception? I saw that the null pointer exception was thrown inside the catch 
block of the IO exception? Any root cause for the IO exception??.

Is that also fixed?



I am including the code that threw the null pointer exception in tike 1.8



Exception:
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)



Code in the pdf parser:
catch (IOException e) {
//nonseq parser throws IOException for bad password
//At the Tika level, we want the same exception to be thrown
if (e.getMessage().contains(Error (CryptographyException))) {
metadata.set(pdf:encrypted, Boolean.toString(true));
throw new EncryptedDocumentException(e);
}


2.)Do you have a snapshot or beta version of tika 1.9 that I could try with 
our pdf corpus? It would also help in your developer testing.

3.)For the inline images, we have just set the defaults(which is to skip 
them as you had mentioned). I have not done any memory profiling till now. I 
will also try that.



Thanks,
MG

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 7:19 AM
To: Mouthgalya Ganapathy; talli...@apache.orgmailto:talli...@apache.org
Cc: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: Memory issues with PDF parser

Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.orgmailto:talli...@apache.org
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-server/artifactId
 version1.8/version
/dependency

I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println(Exception in updating docbody for report == 
 + report.getDocID());
   if(is==null

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
You will get the same exception.  If you run the pure Tika app commandline on a 
triggering file, does it at least show you the caused by clause that might 
give more information?

Other question: Are you sure that you want to avoid parsing attachments?


From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Thursday, June 04, 2015 2:55 PM
To: Allison, Timothy B.
Cc: user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues with PDF parser

Thanks for the update Timothy,
I see that Tika 1.9.-SNAPSHOT is available in maven repo. I am going to try 
that and  will use TikaInputStreams. I will update the results.

Given below is the IO exception that I get when I use Autoparser to extract pdf 
contents. I had used Tika 1.6. and pdfbox 1.8.9. I am guessing I will get the 
same/similar exception when I am going to run it with 1.9-SNAPSHOT.

1:27:53,921 WARN  [org.hornetq.core.client.impl.ClientSessionImpl] (Thread-4 
(HornetQ-client-global-threads-248507153)) resetting session after failure
[Server:research-etl-server] 21:29:16,314 INFO  [stdout] (Thread-12 
(HornetQ-client-global-threads-248507153)) Exception in updating docbody for 
report == RPT_720610
[Server:research-etl-server] 21:29:23,817 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@29fe5969mailto:org.apache.tika.parser.pdf.PDFParser@29fe5969
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:250)
[Server:research-etl-server] 21:29:23,818 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:888)
[Server:research-etl-server] 21:29:23,820 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:983)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:678)
[Server:research-etl-server] 21:29:23,821 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
[Server:research-etl-server] 21:29:23,822 WARN  
[org.hornetq.core.server.impl.ServerSessionImpl] (hornetq-failure-check-thread) 
Cleared up resources for session dc692df4-0a50-11e5-8aa3-005056900299
[Server:research-etl-server] 21:29:23,822 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
java.lang.reflect.Method.invoke(Method.java:597)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
[Server:research-etl-server] 21:29:23,823 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36)
[Server:research-etl-server] 21:29:23,824 ERROR [stderr] (Thread-12 
(HornetQ-client-global-threads-248507153)) at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)



Thanks,
Mouthgalya Ganapathy
Product Development Team
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, June 04, 2015 12:50 PM
To: Mouthgalya Ganapathy
Cc: user@tika.apache.orgmailto:user@tika.apache.org; Sauparna Sarkar
Subject: RE: Memory issues

RE: Memory issues with PDF parser

2015-06-04 Thread Allison, Timothy B.
Hi Mouthgalya,
  We fixed that NPE in https://issues.apache.org/jira/browse/TIKA-1605, and the 
fix will be available in Tika 1.9, which should be out within a week.
  As for memory issues, we worked around a memory leak in PDFBox with static 
caching of fonts for Tika 1.7 (may have been 1.8), but there may be others.  
One potential memory hog is the processing of inline images within PDFs...have 
you configured Tika to pull those out (default is to skip them)?  Other than 
that, I'd recommend dropping a note to the PDFBox users list to get help in 
diagnosing memory consumption with PDFBox.  Have you tried any memory profiling?

  Best,

Tim

From: Mouthgalya Ganapathy [mailto:mouthgalya.ganapa...@fitchratings.com]
Sent: Wednesday, June 03, 2015 3:25 PM
To: talli...@apache.org
Subject: Memory issues with PDF parser

Hi all,
I am trying to use Apache tika 1.8 for extracting contents from pdf. I have the 
below code for extracting it. It works well for few files. But if I read many 
files , I see out of memory exception.
I also see a Null pointer exception in the pdf parser. I think the null pointer 
exception is because of the memory exception.
Any suggestions?

Tika version:
  dependency
 groupIdorg.apache.tika/groupId
 artifactIdtika-server/artifactId
 version1.8/version
/dependency

I am running it as a part of J2EE APP in JBoss 1.7

Code:-

//Parse the pdf content using Apache Tikka
InputStream is = null;
try {
  is = new BufferedInputStream(new FileInputStream(input));
  //Disable write limit.
  contenthandler = new BodyContentHandler(-1);
   metadata = new Metadata();
  pdfparser = new PDFParser();
  context = new ParseContext();
  pdfparser.parse(is, contenthandler, metadata, context);
  docBody=contenthandler.toString();
  //System.out.println(contenthandler.toString());
}
catch (Exception e) {
   System.out.println(Exception in updating docbody for report == 
 + report.getDocID());
   if(is==null)
 System.out.println(The input stream is a null object);
   e.printStackTrace();
  logger.log(Level.SEVERE, e.getMessage(), e);
}
finally {
if (is != null) is.close();
contenthandler=null;
metadata=null;
pdfparser=null;
context =null;
}


Exception:-
I am just including the null pointer exception in the parser below.

10:53:11,696 INFO  [stdout] (Thread-11 
(HornetQ-client-global-threads-1619682129)) Exception in updating docbody for 
report == RPT_764268
10:53:12,218 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129)) java.lang.NullPointerException
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.updateDocBody(ResearchReportMDAO.java:881)
10:53:12,219 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.loadFile_NEW(ResearchReportMDAO.java:965)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.researchapi.dao.ResearchReportMDAO.upsert_NEW(ResearchReportMDAO.java:676)
10:53:12,220 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
com.fitch.research.ejb.ResearchReportManagerBean.processResearchReport(ResearchReportManagerBean.java:70)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
10:53:12,221 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
java.lang.reflect.Method.invoke(Method.java:597)
10:53:12,222 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 
org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53)
10:53:12,223 ERROR [stderr] (Thread-11 
(HornetQ-client-global-threads-1619682129))at 

RE: CSV Parser in Tika

2015-06-19 Thread Allison, Timothy B.
Y, that’s my belief.

As of now, we’re treating them as text files, which can lead to some really 
long = bogus tokens in Lucene/Solr with analyzers that don’t split on commas. ☹

Detection without filename would be difficult.





From: lewis john mcgibbney [mailto:lewi...@apache.org]
Sent: Friday, June 19, 2015 9:59 AM
To: user@tika.apache.org
Subject: CSV Parser in Tika

Hi Folks,
Am I correct in saying that we can't detect CSV in Tika?
We import commons-csv in tika-parsers/pom.xml, however I don't see a csv 
package and registered parser.
Also, when I use the webapp I get the following for a test csv file with 
semicolon ';' separators

Content-Encoding: ISO-8859-1
Content-Length: 217
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: test-semicolon.csv
Any comments please?
Thanks
Lewis




xml vs html parser

2015-06-16 Thread Allison, Timothy B.
All,

  On govdocs1, the xml parser's exceptions accounted for nearly a quarter of 
all thrown exceptions at one point (Tika 1.7ish).  Typically, a file was 
mis-identified as xml when in fact it was sgml or some other text based file 
with some markup that wasn't meant to be xml.

  For kicks, I switched  the config to use the HtmlParser for files identified 
as xml.  This got rid of the exceptions, but the content was quite different 
(ballpark 6k files out of 35k files had similarity  0.95) mostly because of 
elisions the quick - thequick, and I assume this is across tags...

  So, is there a way to make the XMLParser more lenient?  Or is there a way to 
configure the HtmlParser to add spaces for non-html tags?

  Or, is there a better solution?



 Thank you!



  Best,



 Tim



RE: Extract PDF inline images

2015-07-06 Thread Allison, Timothy B.
Hi Andrea,

  The RecursiveParserWrapper, as you found, is only for extracted content and 
metadata.   It was designed to cache metadata and content from embedded 
documents so that you can easily keep those two things together for each 
embedded document.

  To extract the raw bytes from embedded files, try implementing an 
EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a look 
at 
http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
 and specifically the inner class MyEmbeddedDocument extractor for an example.  
As another example, look at 
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
 and specifically the inner class: FileEmbeddedDocumentExtractor


Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, 
and you should be good to go.

public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}

public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {

  Best,

   Tim

From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Monday, July 06, 2015 6:11 AM
To: user@tika.apache.org
Subject: Extract PDF inline images

Hello,
I'm trying to store the inline images from a PDF to a local folder, but can't 
find any valid example. I can only use the RecursiveParserWrapper to get all 
the available metadata, but not the binary image content.
This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream(/BA200PDE.PDF);
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File(out.txt)), UTF-8);
parser.parse(is, handler, metadata, context);
How can I store each image file to a folder?
Thanks
Andrea


RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Thank you, Ken!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, July 21, 2015 10:23 AM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

Responses inline below.

-- Ken



From: Allison, Timothy B.

Sent: July 21, 2015 5:29:37am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: RE: robust Tika and Hadoop

Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Correct.



Out of curiosity, three questions:
1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.


2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

If by tasks per JVM you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.


3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.


From: Ken Krugler [mailto:kkrugler_li...@transpac.comhttp://transpac.com/]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.comhttp://www.scaleunlimited.com/
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr






--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.comhttp://www.scaleunlimited.com/
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr






--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







FW: error Unsupported Media Type : while implementing ContentStreamUpdateRequestExample from the link http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

2015-07-22 Thread Allison, Timothy B.
What happens when you run straight tika-app against that pdf file?

java -jar tika-app.jar Sample.pdf

(grab tika-app from: http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.9.jar)

Do you have all of the tika jars on your classpath/properly configured within 
your Solr setup?

-Original Message-
From: Kathrincolyn [mailto:kathrinco...@yahoo.in] 
Sent: Wednesday, July 22, 2015 5:57 AM
To: tika-...@lucene.apache.org
Subject: Re: error Unsupported Media Type : while implementing 
ContentStreamUpdateRequestExample from the link 
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

public class SolrExampleTests {

   public static void main(String[] args) {
 try {
   //Solr cell can also index MS file (2003 version and 2007 version)
 types.
   String fileName = c:/Sample.pdf;
   //this will be unique Id used by Solr to index the file contents.
   String solrId = Sample.pdf;

   indexFilesSolrCell(fileName, solrId);

 } catch (Exception ex) {
   System.out.println(ex.toString());
 }
   }

   /**
* Method to index all types of files into Solr.
* @param fileName
* @param solrId
* @throws IOException
* @throws SolrServerException
*/
   public static void indexFilesSolrCell(String fileName, String solrId)
 throws IOException, SolrServerException {

 String urlString = http://localhost:8983/solr;;
 SolrServer solr = new CommonsHttpSolrServer(urlString);

 ContentStreamUpdateRequest up
   = new ContentStreamUpdateRequest(/update/extract);

 up.addFile(new File(fileName));

 up.setParam(literal.id, solrId);
 up.setParam(uprefix, attr_);
 up.setParam(fmap.content, attr_content);

 up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

 solr.request(up);

 QueryResponse rsp = solr.query(new SolrQuery(*:*));

 System.out.println(rsp);
   }
 }

 Thanks
Ufindthem http://www.ufindthem.com  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-Unsupported-Media-Type-while-implementing-ContentStreamUpdateRequestExample-from-the-link-httpe-tp4169035p4218516.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


RE: robust Tika and Hadoop

2015-07-20 Thread Allison, Timothy B.
Thank you, Ken and Mark.  Will update wiki over the next few days!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: robust Tika and Hadoop

2015-07-21 Thread Allison, Timothy B.
Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Out of curiosity, three questions:

1)  If I had more time to read your code, the answer would be 
obvious...sorryHow are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

2)  Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

3)  Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

Thank you, again.

Best,

 Tim

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken



From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.orgmailto:user@tika.apache.org

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







robust Tika and Hadoop

2015-07-15 Thread Allison, Timothy B.
All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

Best,

  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/



RE: Inconsistent (buggy) behavior when using tika-server

2015-07-14 Thread Allison, Timothy B.
That looks like a bug in TikaUtils.

For whatever reason, when is.available() returns 0, we are then assuming that 
fileUrl is not null.  We need to check to make sure that fileUrl is not null 
before trying to open the file.

if(is.available() == 0  !.equals(fileUrl)){
...

return TikaInputStream.get(new URL(fileUrl), metadata);

Would you mind opening a ticket on jira?

All,
  Is there a reason why an inputstream would return 0 for available() but still 
be readable?

Best,

   Tim


From: Malarout, Namrata (398M-Affiliate) [mailto:namrata.malar...@jpl.nasa.gov]
Sent: Tuesday, July 14, 2015 1:35 PM
To: user@tika.apache.org
Subject: Inconsistent (buggy) behavior when using tika-server

Hi Folks,
I am using Tika trunk (1.10-SNAPSHOT) and posting documents there. An example 
would be the following:


curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header Accept: application/json

...

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header Accept: application/rdf+xml

...

curl -T MOD09GA.A2014010.h30v12.005.2014012183944.vegetation_fraction.tif  
http://localhost:9998/meta --header Accept: text/csv



I am using a python script to iterate through all the files in a folder. It 
works for about 50% to 80% of the files. For the rest it gives an error 500. 
When I post a file individually for which it previously failed (using the 
python script) it sometimes works. When done in an ad hoc manner, it works most 
of the time but fails sometimes. At times it is successful for 
application/rdf+xml format but fails for application/json format. The behavior 
is inconsistent.



Here is an example trace of when it does not work as expected [0]

A sample of the data being used can be found here [1]

Any help would be appreciated.



[0] https://paste.apache.org/lbAm



[1] 
https://drive.google.com/file/d/0B6wmo4_-H0P2eWJjdTdtYS1HRGs/view?usp=sharing



Thanks,

Namrata Malarout


RE: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-21 Thread Allison, Timothy B.
+0 (some regressions in ppt content)

I just finished the batch comparison run on  ~1.8 million files in our govdocs1 
and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1.  As a caveat, the eval 
code is still in development and there may be bugs in the reports.

Results are here: 
https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip
 

Key reports:
contents/content_diffs.csv (file had one corrupt row when viewing in 
Excel...manually deleted offending content)
exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful)
exceptions/fixedExceptionsInBByMimeType.csv  (none!)
mimes/mime_diffs_A_to_B.csv

On the positive side:
From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as pdfs 
(that text/xhtml) than we were...great!  We're identifying more files as images 
(jpeg, pict) than as xhtml, and, from a quick look, this appears to be an 
improvement.  We have at least 9 new x-hwp-v5 (great!).

On the negative side:

1) We have a few regressions in ppt exceptions (six of the same aioobe).
2) We have regressions in ppt content (it looks like we're not adding a new 
line/word break where we need to).  The regressions are small per file, but 
they affect ~220 ppts out of ~1500 (~15%). 

Other than the regressions in ppt content, I'd be +1, but I don't think this is 
severe enough to warrant a re-spin.  Happy to look into a fix, though, if we 
want a re-spin...and even if we don't, I'll start looking into this asap.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, October 19, 2015 10:23 AM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.11 Release Candidate #1

Hi Folks,

A first candidate for the Tika 1.11 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/

The SHA1 checksum of the archive is
d0dde7b3a4f1a2fb6ccd741552ea180dddab630a

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1014/


Please vote on releasing this package as Apache Tika 1.11.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this 
package because…

Cheers,
Chris

P.S. Of course here is my +1.



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++





RE: Tika unable to extract PDF Text

2015-10-14 Thread Allison, Timothy B.
File works with Tika trunk.  What's on your classpath: tika-app or just 
tika-core?  Is there a chance that you don't have tika-parsers on your cp?


-Original Message-
From: Adam Retter [mailto:adam.ret...@googlemail.com] 
Sent: Wednesday, October 14, 2015 12:14 PM
To: user@tika.apache.org
Subject: Tika unable to extract PDF Text

I have a PDF which was created using Apache PDF Box 2.0.0-SNAPSHOT.
Unfortunately Tika 1.10 seems unable to extract any text from the PDF, I don't 
get any exceptions or errors. The code is as simple as:

new Tika().parseToString(new FileInputStream(f))

Tika is always returning just the empty string.

The PDF is available here - http://static.adamretter.org.uk/adam-1.pdf

Any ideas?

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk


RE: Extract PDF inline images

2015-07-07 Thread Allison, Timothy B.
Andrea,
  I’m about to commit an example (see TIKA-1674).  In about 10 minutes, look 
for org.apache.tika.example.ExtractEmbeddedFiles in the tika-examples module.
  I’m still a bit stumped though on why my example isn’t working recursively.  
It is only pulling out the children of the input document.  Stay tuned to 
TIKA-1674 for follow up on that.

   Best,

  Tim

From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Tuesday, July 07, 2015 6:22 AM
To: user@tika.apache.org
Subject: Re: Extract PDF inline images

Hi Tim,
thanks for your response, but I can't find a complete solution.
I've created a class using the same FileEmbeddedDocumentExtractor from TikaCLI, 
and now I'm trying to do a sample main program with a PDF containing some 
images.
This is my code, but I can't have any image stored and the methods of 
DocumentExtractor are never called using debugger.
Thanks
Andrea

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

FileEmbeddedDocumentExtractor extractor = new FileEmbeddedDocumentExtractor();
context.set(FileEmbeddedDocumentExtractor.class, extractor);

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);

context.set(org.apache.tika.parser.Parser.class, new AutoDetectParser());

InputStream is = PdfRecursiveExample.class.getResourceAsStream(/my.PDF);
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File(out.txt)), UTF-8);
parser.parse(is, handler, metadata, context);

2015-07-06 12:59 GMT+02:00 Allison, Timothy B. 
talli...@mitre.orgmailto:talli...@mitre.org:
Hi Andrea,

  The RecursiveParserWrapper, as you found, is only for extracted content and 
metadata.   It was designed to cache metadata and content from embedded 
documents so that you can easily keep those two things together for each 
embedded document.

  To extract the raw bytes from embedded files, try implementing an 
EmbeddedDocumentExtractor and passing that into the ParseContext.  Take a look 
at 
http://svn.apache.org/repos/asf/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
 and specifically the inner class MyEmbeddedDocument extractor for an example.  
As another example, look at 
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java,
 and specifically the inner class: FileEmbeddedDocumentExtractor


Basically, in ParseEmbedded, just copy the InputStream to a FileOutputStream, 
and you should be good to go.

public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}

public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {

  Best,

   Tim

From: Andrea Asta [mailto:asta.and...@gmail.commailto:asta.and...@gmail.com]
Sent: Monday, July 06, 2015 6:11 AM
To: user@tika.apache.orgmailto:user@tika.apache.org
Subject: Extract PDF inline images

Hello,
I'm trying to store the inline images from a PDF to a local folder, but can't 
find any valid example. I can only use the RecursiveParserWrapper to get all 
the available metadata, but not the binary image content.
This is my code:

RecursiveParserWrapper parser = new RecursiveParserWrapper(
  new AutoDetectParser(),
  new BasicContentHandlerFactory(HANDLER_TYPE.XML, -1)
);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
PDFParser p;
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(false);
context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
context.set(org.apache.tika.parser.Parser.class, parser);

InputStream is = PdfRecursiveExample.class.getResourceAsStream(/BA200PDE.PDF);
//parsing the file
ToXMLContentHandler handler = new ToXMLContentHandler(new FileOutputStream(new 
File(out.txt)), UTF-8);
parser.parse(is, handler, metadata, context);
How can I store each image file to a folder?
Thanks
Andrea



RE: TikaConfig with constructor args

2015-08-27 Thread Allison, Timothy B.
That’s on my todo list (TIKA-1508).  Unfortunately, that doesn’t exist yet.  
I’d recommend for now following the pattern of the PDFParser or the 
TesseractOCRParser.  The config is driven by a properties file.

As soon as my dev laptop becomes unbricked, I’m going to turn to TIKA-1508.  
Given my schedule, I’d hope to have this into tika trunk within the next few 
weeks.


From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Thursday, August 27, 2015 4:38 AM
To: user@tika.apache.org
Subject: TikaConfig with constructor args

Hi all,
I've developed a new Parser for my custom file type.
This parser needs some configuration to init an external connections. Is there 
a way to specify the constructor params (or bean properties to set) in the Tika 
xml format?
Thanks
Andrea


RE: tesseract issue

2015-09-09 Thread Allison, Timothy B.
You can build from source if you have an interest (and the bandwidth, time and 
disk space) or pull a nightly build if you don’t want to wait for 1.11, for 
example: 
https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/849/org.apache.tika$tika-app/

Thank you, Christian!

Best,

Tim

From: Brian Young [mailto:bwyoung.s...@gmail.com]
Sent: Wednesday, September 09, 2015 4:09 PM
To: user@tika.apache.org
Subject: Re: tesseract issue

Ah that is very good- thank you.  Looks like it will be in 1.11.



On Wed, Sep 9, 2015 at 4:00 PM, Christian Wolfe 
> wrote:
Brian,

I submitted a patch for this bug that was accepted by the team - 
https://github.com/apache/tika/pull/56

I do'nt think it has made it to any release version.

On Wed, Sep 9, 2015 at 3:55 PM, Brian Young 
> wrote:
Hello,

On OS X at least, tesseract and tessdata may not be under a common root.  e.g.:


/opt/local/share/tessdata

/opt/local/bin/tesseract



Unfortunately it looks like TesseractOCRParser does not accommodate for this 
since there is only one configuration value that is used for finding the binary 
as well as setting the TESSDATA _PREFIX environment var.



Now, TESSDATA_PREFIX does not get set if I do not pass in the path on the 
config object.  However, even though tesseract is in my path, it isn't found 
when the ProcessBuilder executes unless I've given it the full path... which of 
course sets the TESSDATA_PREFIX to the wrong thing.



It seems like maybe it would be best to handle these as two separate 
configuration values?  But short of that and a new version of Tika, does anyone 
have any other advice?



Thank you

Brian












RE: RecursiveParser returning ContentHandler

2015-09-22 Thread Allison, Timothy B.
Y, that should be easy enough.  Instead of the metadata list, we can store a 
list of Metadata+Handler pairs, the current “getMetadata()” can be syntactic 
sugar around the new getMetadataAndHandlers().

Please open a ticket and we can discuss there.

Thank you.

Best,

   Tim



From: Andrea Asta [mailto:asta.and...@gmail.com]
Sent: Monday, September 21, 2015 8:00 AM
To: user@tika.apache.org
Subject: RecursiveParser returning ContentHandler

Hi,
I', trying to build a custom Conversion API using Tika: it just will add 
"something before" and "something after" the Tika parsers.

In this scenario, I would like to build a mechanism to allow a custom object 
being built starting from a parsing result. This can be done easily by working 
with a custom ContentHandler "transformer", but how can I achieve this result 
using a RecursiveParserWrapper? In this case I can only set a 
ContentHandlerFactory and the parser will just call the toString method and set 
it as a metadata, is it right? Can we imagine something to get the entire 
ContentHandler object for each subfile instead of the result of the toString 
method?

Thanks
Andrea


RE: Maximizing performance when parsing a lot of files

2015-09-25 Thread Allison, Timothy B.
It's best to keep Tika in its own jvm.

If you are working filesystem to filesystem... The simplest thing to do would 
be to call tika-batch via the commandline of tika-app every so often.  By 
default, tika-batch will skip files that it has already processed if you run it 
again, but you will pay the small performance cost of crawling the entire 
directory with each run and checking whether there is an output file for each 
input file.

If you think this is a common enough use case, and I do, I'm wondering if it 
would make sense for us to experiment with adding a WatchService to 
tika-batch...Scratch that...probably wouldn't scale ("This API is not designed 
for indexing a hard drive. Most file system implementations have native support 
for file change notification."[0]).  I'm wondering if we could have the crawler 
automatically rerun from the start directory until the user tells tika-batch to 
stop or unless there have been no new files processed in X minutes.
 
If you are going db to db...that's another area for growth in tika-batch.

Finally, the real "big data" solution is probably to go with Spark and friends.

[0] https://docs.oracle.com/javase/tutorial/essential/io/notification.html
-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Friday, September 25, 2015 7:33 AM
To: user@tika.apache.org
Subject: Maximizing performance when parsing a lot of files

So I have thousands of files to be run by Tika. Unfortunatly, these are not 
available at once but are "created" one by one. My tests have shown that the 
creator process is faster than Tika. So now I am wondering how I should combine 
creator and parser process to speed things up.
Btw. the creator is completly separate, otherwise I would include the parser 
calls directly in it. But this is not possible.
To achieve some kind of parallelism I thought of two options:
1) Spawn a new small Java code piece which parses a file
2) Send the file to Tika Jaxrs Server
But since the creator is so fast it would fire up multiple calls to Tika per 
second. On the other hand I don't want to wait for the creator to finish 
because it runs for houres and in the meantime I could already start parsing.
Any ideas?


RE: Questions about using AutoDetect and DigestParser

2016-01-05 Thread Allison, Timothy B.
>>Question1) Shouldn't this be more specific? Like PdfParser, 
>>OpenDocumentParser and so on.

Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of 
values and then iterate through that array to see the parsers that actually 
processed your doc.  If you call metadata.get(Property p), you only get the 
first value in the array.

>> Question2) I understand that there is the DigestingParser to add Md5 and 
>> Sha1 hashes to the metadata. But how can I "combine" the AutoDetectParser 
>> and the DigestingParser?

See DigestingParserTest [0] for exact code, but basically something like this:

Metadata m = new Metadata();
CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse("md5,sha512");
Parser d = new DigestingParser(new AutoDetectParser(), new 
CommonsDigester(100, algos, m)

d.parse(InputStream)



[0] 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java?view=markup
-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Tuesday, January 05, 2016 3:33 AM
To: user@tika.apache.org
Subject: Questions about using AutoDetect and DigestParser

Happy New Year everyone,
I have a small program for simple text and metadata extraction. It is really 
not more than this (in Scala):

val fileParser : AutoDetectParser = new AutoDetectParser()
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()

try {
fileParser.parse(stream, handler, metadata, context)
} catch ...

When I look at the metadata I always have this line: X-Parsed-By: 
org.apache.tika.parser.DefaultParser
Question1) Shouldn't this be more specific? Like PdfParser, OpenDocumentParser 
and so on.

Question2) I understand that there is the DigestingParser to add Md5 and Sha1 
hashes to the metadata. But how can I "combine" the AutoDetectParser and the 
DigestingParser?

Thanks so far
Kind regards


RE: Questions about using AutoDetect and DigestParser

2016-01-08 Thread Allison, Timothy B.
Sorry I couldn't help.  Please do let us know if you figure out what's going on.

Best,

 Tim

-Original Message-
From: zahlenm...@gmx.de [mailto:zahlenm...@gmx.de] 
Sent: Friday, January 08, 2016 3:43 AM
To: user@tika.apache.org
Subject: Re: Questions about using AutoDetect and DigestParser

Actually I think the test code is quite good to get an understanding how the 
DigestingParser works.  I tried every combination I could think of, but I 
couldn't make it work. The code mirrors the unit test as close as possible 
(only the input stream is different). As it seems it is related to my use of 
Scala. If I find the time I will try it again with Java to further pinpoint the 
problem. In the meantime I think I'll stick to java.security.MessageDigest.

Kind regards

-Original Message-
Sent: Thursday, 07 January 2016 um 18:49:09 Uhr
From: "Allison, Timothy B." <talli...@mitre.org>
To: "user@tika.apache.org" <user@tika.apache.org>
Subject: RE: Questions about using AutoDetect and DigestParser As for 1, y, 
sorry, that's a bug I've been meaning to fix... 

As for 2, you're right, the test code is fairly opaque.  Sorry.  The code below 
works when I put it in DigestingParserTest.

The behavior you're seeing with AutoDetectParser() happens when the 
AutoDetectParser fails to load parsers either via the config file or via SPI, 
which reads parsers to load from the Parser class' service file.  Is there any 
reason to think you're getting different SPI behavior with, say (= I don't know 
Scala, and I'm guessing...sorry)

val fileParser : Parser = new AutoDetectParser()

vs.

val fileParser : Parser = new DigestingParser(new AutoDetectParser(), digester)


I'm sure you've tried the following for kicks...(again, apologies for guessing)
val autoParser : AutoDetectParser = new AutoDetectParser()
val fileParser : DigestingParser = new DigestingParser(autoParser, 
digester)


Java unit test that works within DigestingParserTest:

@Test
public void testSimple() throws Exception {
CommonsDigester.DigestAlgorithm[] algos = 
CommonsDigester.parse("md5,sha256,sha384,sha512");
Metadata metadata = new Metadata();
Parser d = new DigestingParser(new AutoDetectParser(), new 
CommonsDigester(UNLIMITED, algos));
ContentHandler handler = new WriteOutContentHandler(-1);
try (InputStream input = 
DigestingParserTest.class.getResourceAsStream("/test-documents/testPDF.pdf")) {
d.parse(input, handler, metadata, new ParseContext());
}

String[] parsedBy = metadata.getValues("X-Parsed-By");
for (String v : parsedBy) {
System.out.println("Parsed by: " + v);
}

assertEquals("org.apache.tika.parser.DefaultParser", parsedBy[0]);
assertEquals("org.apache.tika.parser.pdf.PDFParser", parsedBy[1]);
}


RE: Bypassing ExtractingRequestHandler

2016-06-14 Thread Allison, Timothy B.
Oh, wow.  Y, that's probably more than we'd want to support (unless any other 
Tika devs have an interest?)...very, very cool!


-Original Message-
From: Justin Lee [mailto:lee.justi...@gmail.com] 
Sent: Monday, June 13, 2016 5:05 PM
To: solr-u...@lucene.apache.org
Subject: Re: Bypassing ExtractingRequestHandler

Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  
The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit 
after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the 
bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in 
the UI.  I like this approach because I get highlighting that accurately 
reflects the search, even when the search is complex (e.g. wildcards or 
proximity searches).

I think it would take quite a bit of thinking to get something general enough 
to add into Tika.  For example, what units?  Take a look at the discussion of 
what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert 
Muir -- although whatever issues there are here they are the same as the 
offsets reported in the Term Vector Component, it would seem to me).  As 
another example, I'm just not sure what format is general enough to make sense 
for everybody.  I think I'll just create a mapping from UTF-16 offsets into 
(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL 
store.  Then, when I get Solr results, I'll look at the matching offsets, the 
JSON blob, and the original document and be on my merry way.  I'm happy to open 
a JIRA entry in Tika if you think this is a coherent request.

The other approach, I suppose, is to try to pass the information along during 
indexing and store as a token payload.  But it seems like the indexing 
interface is really text oriented.  I have also thought about using 
DelimitedPayloadTokenFilter, which will increase the index size I imagine (how 
much, though?) and require more customization of Solr internals.  I don't know 
which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
> >stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible 
> > that Tika will crash or hang with certain files - and that will 
> > bring down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification to Tika to get and save 
> >> additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over 
> on Tika's JIRA and we'll try to add it directly?
>
>
>
>


RE: Weird spacing in words

2016-05-31 Thread Allison, Timothy B.
Sorry I couldn't help.

-Original Message-
From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] 
Sent: Tuesday, May 31, 2016 9:10 AM
To: user@tika.apache.org
Subject: Re: Weird spacing in words

Hi, 

I do get the same result using pdfbox. I will open an issue over there.
Thanks for the help.

Best regards,
Augusto

> On 31 May 2016, at 14:35, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
> PDFs don't necessarily include spaces.  In some (many?) cases, code has to do 
> the calculation of character widths and locations on the page to determine 
> whether or not to insert spaces.  If something goes wrong with the coordinate 
> calculations, you can get extra or missing spaces.
> 
> You could experiment with changing enableAutoSpace to false via the 
> PDFParserConfig, but I doubt that would fix the problem.
> 
> If you run straight PDFBox's app [1]
> 
> java -jar pdfbox-app...jar ExtractText file.pdf
> 
> Do you get the same spacing?  If so, please open an issue on PDFBox's issue 
> tracker.
> 
> 
> [1] http://mirror.reverse.net/pub/apache/pdfbox/2.0.1/pdfbox-app-2.0.1.jar
> 
> -Original Message-
> From: Augusto Ribeiro Silva [mailto:a...@unsilo.com] 
> Sent: Tuesday, May 31, 2016 7:36 AM
> To: user@tika.apache.org
> Subject: Weird spacing in words 
> 
> Hi all,
> 
> I am using TIKA java library to read the content of some PDFs and it seems 
> like it inserts some weird (hyphen-like) spacing. For example:
> The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment 
> (PRM) sys tem can po ten tially ad dress sev eral as pets
> 
> I tried to extract text from the same PDF using the pdftotext command line 
> utility it extracts the text correctly:
> The establishment of an integrated Partner Relationship Management (PRM) 
> system can potentially address several aspects 
> 
> Does somebody have any idea why TIKA behaves in this way and any tips to 
> fixing it?
> 
> Best regards, 
> Augusto



RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the logic close to what I sowed.  I use 
contentHandler.toString() to get back the raw text so I can save it.  Even if I 
get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I 
still have far more formats to test) I do *NOT* see the OOM issue even when I 
increase the loop to 1000 -- memory usage remains steady and stable.  This is 
why in my original email I asked if there is an issue with XML files or with my 
code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:114)
at java.io.StringWriter.write(StringWriter.java:106)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)

Thanks

Steve


On Mon, Feb 8, 2016 at 3:07 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4...@gmail.com<mailto:sw

RE: Preventing OutOfMemory exception

2016-02-08 Thread Allison, Timothy B.
I’m not sure why you’d want to append document contents across documents into 
one handler.  Typically, you’d use a new ContentHandler and new Metadata object 
for each parse.  Calling “toString()” does not clear the content handler, and 
you should have 20 copies of the extracted content on your final loop.

There shouldn’t be any difference across file types in the fact that you are 
appending a new copy of the extracted text with each loop.  You might not be 
seeing the memory growth if your other file types aren’t big enough and if you 
are only doing 20 loops.

But the larger question…what are you trying to accomplish?

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Monday, February 08, 2016 1:38 PM
To: user@tika.apache.org
Subject: Preventing OutOfMemory exception

Hi everyone,

I'm integrating Tika with my application and need your help to figure out if 
the OOM I'm getting is due to the way I'm using Tika or if it is an issue with 
parsing XML files.

The following example code is causing OOM on 7th iteration with -Xmx2g.  The 
test will pass with -Xmx4g.  The XML file I'm trying to parse is 51mb in size.  
I do not see this issue with other file types that I tested so far.  Memory 
usage keeps on growing with XML file types, but stays constant with other file 
types.

public class Extractor {
private BodyContentHandler contentHandler = new BodyContentHandler(-1);
private AutoDetectParser parser = new AutoDetectParser();
private Metadata metadata = new Metadata();

public String extract(File file) throws Exception {
try {
stream = TikaInputStream.get(file);
parser.parse(stream, contentHandler, metadata);
return contentHandler.toString();
}
finally {
stream.close();
}
}
}

public static void main(...) {
Extractor extractor = new Extractor();
File file = new File("C:\\temp\\test.xml");
for (int i = 0; i < 20; i++) {
extractor.extract(file);
}

Any idea if this is an issue with XML files or if the issue in my code?

Thanks

Steve



RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a 
production environment, I encourage separating Tika into a separate jvm vs 
tying it into any post processing – consider tika-batch and writing separate 
text files for each file processed (not so efficient, but exceedingly robust).  
If this is demo code or you know your document set well enough, you should be 
good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 09, 2016 10:35 AM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new 
BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It 
is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single 
instance of AutoDetectParser and Metadata throughout the life of my 
application?  The reason why I'm reusing a single instance is to cut down on 
overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 
parse them using the logic close to what I sowed.  I use 
contentHandler.toString() to get back the raw text so I can save it.  Even if I 
get ride of that call, I run into OOM.

Note that, if I test the exact same code against PDF or PPT or ODP or RTF (I 
still have far more formats to test) I do *NOT* see the OOM issue even when I 
increase the loop to 1000 -- memory usage remains steady and stable.  This is 
why in my original email I asked if there is an issue with XML files or with my 
code such as if I'm missing to close / release something.

Here is the full call stack when I get the OOM:

  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuffer.ensureCapacityImpl(StringBuffer.java:338)
at java.lang.StringBuffer.append(StringBuffer.java:114)
at java.io.StringWriter.write(StringWriter.java:106)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.TextContentHandler.characters(TextContentHandler.java:55)
at 
org.apache.tika.sax.TeeContentHandler.characters(TeeContentHandler.java:102)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.pars

RE: Preventing OutOfMemory exception

2016-02-09 Thread Allison, Timothy B.
Tika can fail catastrophically (permanent hangs, memory leaks, oom and other 
surprises).  These problems happen very, very rarely, and we fix problems as 
soon as we can, but really bad things can happen – see, e.g. TIKA-1132, 
TIKA-1401, SOLR-7764, PDFBOX-2200 and [0] and [1].

Tika runs within memory in Solr Cell.  The good news is that Tika works so well 
that no one has gotten around to putting it into its own jvm in Solr Cell.  I’m 
active on the Solr list and have shared potential problems with running Tika in 
the same jvm several times over there.

So, the short answer is: with the exception of TIKA-1401, I don’t _know_ of 
specific vulnerabilities that would cause serious problems with Tika.  However, 
given what we’ve seen, I have little reason to believe that these issues won’t 
happen again…very, very rarely.

I added tika-batch, which you can run from the commandline of tika-app, to 
handle these catastrophic failures.  You can also wrap your own solution via 
ForkParser or other methods.

[0] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] http://www.slideshare.net/gagravarr/whats-new-with-apache-tika
[2] 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201507.mbox/%3cjira.12843538.1436367863000.133708.1436382786...@atlassian.jira%3E

From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 09, 2016 5:37 PM
To: user@tika.apache.org
Subject: Re: Preventing OutOfMemory exception

Thanks for the confirmation Tim.

This is a production code, so ...

I'm a bit surprise why you suggest I keep the Tika code out-of-process as 
standalone application vs. directly using it from my app.  Are there known 
issues with Tika to prevent it from being used in a long running process?  Does 
Solr use Tika as an out-of-process application?  See 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
 (I will also ask this question on the Solr mailing list).

A bit background about my application.  I am writing a file system crawler that 
will run 24x7xN-days uninterrupted.  The application monitors the file system 
once every N min. where N can be anywhere from 1 min and up for new files or 
updated files.  It will then send the file to Tika to extract the raw text and 
the raw text is than sent to Solr for indexing.  My file-system-crawler will 
not be recycled or stopped unless if the OS has to be restarted.  Thus, I 
expect it to run 24x7xN-days.  Finally, the file system is expected to be busy 
where on average there will be 10 new files added / updated per minute.  
Overall, I'm expecting to make at least 10 calls to Tika per min.

Steve


On Tue, Feb 9, 2016 at 12:07 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
Same parser is ok to reuse…should even be ok in multithreaded applications.

Do not reuse ContentHandler or Metadata objects.

As a side note, if you are handling a bunch of files from the wild in a 
production environment, I encourage separating Tika into a separate jvm vs 
tying it into any post processing – consider tika-batch and writing separate 
text files for each file processed (not so efficient, but exceedingly robust).  
If this is demo code or you know your document set well enough, you should be 
good to go with keeping Tika and your postprocessing steps in the same jvm.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Tuesday, February 09, 2016 10:35 AM

To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Thanks Tim!!  You helped me find the defect in my code.

Yes, I'm using one BodyContentHandler.  When I changed my code to create a new 
BodyContentHandler for each XML file I'm parsing, I no longer see the OOM.  It 
is weird that I see this issue with XML files only.

For completeness, can you confirm if I have an issue in re-using a single 
instance of AutoDetectParser and Metadata throughout the life of my 
application?  The reason why I'm reusing a single instance is to cut down on 
overhead (I have yet to time this).

Steve


On Mon, Feb 8, 2016 at 8:33 PM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
In your actual code, are you using one BodyContentHandler for all of your 
files?  Or are you creating a new BodyContentHandler for each file?  If the 
former, then, y, there’s a problem with your code; if the latter, that’s not 
something I’ve seen before.

From: Steven White [mailto:swhite4...@gmail.com<mailto:swhite4...@gmail.com>]
Sent: Monday, February 08, 2016 4:56 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Preventing OutOfMemory exception

Hi Tim,

The code I showed is a minimal example code to show the issue I'm running into, 
which is: memory keeps on growing.

In production, the loop that you see will read files off a file system and 

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
x-post to Tika user's

Y and n.  If you run tika app as: 

java -jar tika-app.jar  

It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This 
creates a parent and child process, if the child process notices a hung thread, 
it dies, and the parent restarts it.  Or if your OS gets upset with the child 
process and kills it out of self preservation, the parent restarts the child, 
or if there's an OOM...and you can configure how often the child shuts itself 
down (with parental restarting) to mitigate memory leaks.

So, y, if your use case allows  , then we now have that 
in Tika.

I've been wanting to add a similar watchdog to tika-server ... any interest in 
that?


-Original Message-
From: xavi jmlucjav [mailto:jmluc...@gmail.com] 
Sent: Thursday, February 11, 2016 2:16 PM
To: solr-user <solr-u...@lucene.apache.org>
Subject: Re: How is Tika used with Solr

I have found that when you deal with large amounts of all sort of files, in the 
end you find stuff (pdfs are typically nasty) that will hang tika. That is even 
worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog 
feature to kill what seemed like a hanged extracting thread. That feature is 
super important for a robust text extracting pipeline. Has Tika gained such 
feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if 
> you use the simple "run Tika in a SolrJ program" approach you _must_ 
> abort the program on OOM errors and the like and  figure out what's 
> going on with the offending document(s). Or record the name somewhere 
> and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing 
> with errors one at a time.
> For robust systems where you have to have indexing available at all 
> times and _especially_ where you don't control the document corpus, 
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
> > I completely agree on the impulse, and for the vast majority of the 
> > time
> (regular catchable exceptions), that'll work.  And, by vast majority, 
> aside from oom on very large files, we aren't seeing these problems 
> any more in our 3 million doc corpus (y, I know, small by today's 
> standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these 
> types of problems, that users aren't complaining too often about 
> catastrophic failures of Tika within Solr Cell, and that this thread 
> is not yet swamped with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and 
> because Tika is a kitchen sink and we can't prevent memory leaks in 
> our dependencies, one needs to be aware that bad things can 
> happen...if only very, very rarely.  For a fellow traveler who has run 
> into these issues on massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On 
> conventional single box setups, the ForkParser within Tika is one 
> option, tika-batch is another.  Hand rolling your own parent/child 
> process is non-trivial and is not necessary for the vast majority of use 
> cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user <solr-u...@lucene.apache.org>
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any 
> real benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> > <talli...@mitre.org>
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod

RE: How is Tika used with Solr

2016-02-11 Thread Allison, Timothy B.
Right.  If you can't dump to a mirrored output directory, then you'll have to 
do your own monitoring.

If you can dump to a mirrored output directory, then tika-app will do all of 
the watchdog stuff for you.

If you can't, then, y, you're on your own.

If you want to get fancy, you could try implementing FileResourceConsumer in 
tika-batchLook at FSFileResourceConsumer as an example.  I've done this for 
reading Tika output and indexing w/ Lucene.

You might also look at StrawmanTikaAppDriver in the tika-batch module for an 
example of some basic multithreaded code that does what you suggest below.

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Thursday, February 11, 2016 6:03 PM
To: solr-u...@lucene.apache.org
Subject: Re: How is Tika used with Solr

Tim,

In my case, I have to use Tika as follows:

java -jar tika-app.jar -t 

I will be invoking the above command from my Java app using 
Runtime.getRuntime().exec().  I will capture stdout and stderr to get back the 
raw text i need.  My app use case will not allow me to use a  
, it is out of the question.

Reading your summary, it looks like I won't get this watch-dog monitoring and 
thus I have to implement my own.  Can you confirm?

Thanks

Steve


On Thu, Feb 11, 2016 at 2:45 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar  
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
> This creates a parent and child process, if the child process notices 
> a hung thread, it dies, and the parent restarts it.  Or if your OS 
> gets upset with the child process and kills it out of self 
> preservation, the parent restarts the child, or if there's an 
> OOM...and you can configure how often the child shuts itself down 
> (with parental restarting) to mitigate memory leaks.
>
> So, y, if your use case allows  , then we now 
> have that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any 
> interest in that?
>
>
> -Original Message-
> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <solr-u...@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of 
> files, in the end you find stuff (pdfs are typically nasty) that will hang 
> tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a 
> watchdog feature to kill what seemed like a hanged extracting thread. 
> That feature is super important for a robust text extracting pipeline. 
> Has Tika gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
> <erickerick...@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if 
> > you use the simple "run Tika in a SolrJ program" approach you _must_ 
> > abort the program on OOM errors and the like and  figure out what's 
> > going on with the offending document(s). Or record the name 
> > somewhere and skip it next time 'round. Or
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing 
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all 
> > times and _especially_ where you don't control the document corpus, 
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of 
> > > the time
> > (regular catchable exceptions), that'll work.  And, by vast 
> > majority, aside from oom on very large files, we aren't seeing these 
> > problems any more in our 3 million doc corpus (y, I know, small by 
> > today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen 
> > these types of problems, that users aren't complaining too often 
> > about catastrophic failures of Tika within Solr Cell, and that this 
> > thread is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state 
> > > (right?),
> > because

RE: Using Tika that comes with Solr 5.2

2016-02-03 Thread Allison, Timothy B.
The problem (I think) is that tika-parsers.jar includes just the Tika parsers 
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). 
 If you are using jars, I’d recommend the tika-app.jar which includes all 
dependencies.
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 02, 2016 7:01 PM
To: user@tika.apache.org
Subject: Using Tika that comes with Solr 5.2

Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm using 
the existing JARs that come with Solr to index data off a file system.  My 
applications scans the file system, looking for files and then uses Tika to 
extract the raw text and then sends the raw text to Solr, using SolrJ, for 
indexing.

What I'm finding is that Tika will not extract the raw text off PDF, 
Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an empty 
string

  in.close();
}

In the above code, 'content' is always empty (the above is: off 
https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of them: 
tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, 
vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and 
kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve


RE: tika is unable to extract outlook messages

2016-02-16 Thread Allison, Timothy B.
See my response to your question on the Solr users’ list here: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3CCY1PR09MB0795E8DBA7B2B6603A45820EC7A80%40CY1PR09MB0795.namprd09.prod.outlook.com%3E

I don’t think this is a Tika problem.  This is the standard way that Solr’s DIH 
handles embedded documents…it concatenates all embedded documents onto one 
String.

If you want to treat each individual attachment as a separate file, you’ll have 
to do preprocessing on your pst or run Tika on your own (see the 
RecursiveParserWrapper, perhaps) and send documents to Solr via SolrJ 
(https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/).




From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.com]
Sent: Tuesday, February 16, 2016 6:35 PM
To: user@tika.apache.org
Subject: tika is unable to extract outlook messages

Hi ,
   I am currently indexing individual outlook messages and searching is 
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  
"http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9=attr_=attr_content=true;
 -F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  
"http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7=attr_=attr_content=true;
 -F 
"myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages into 
one message. I am not getting all tags when extracted individual messgaes. is 
above command is correct? is it problem not using recursion?  how to add 
recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.
--sreenivasa kallu


RE: Using tika-app-1.11.jar

2016-02-11 Thread Allison, Timothy B.
Plan C: if you’re willing to store a mirror set of directories with the text 
versions of the files, just run tika-app.jar on your “input” directory and run 
your SolrJ loader on the “text/export” directory:

java -jar tika-app.jar  

And, if you’re feeling jsonic:

java -jar tika-app.jar –J -t –i  -o 


This method of running Tika will be robust to OOM, permanent hangs and 
OS-destroying-your-process-out-of-self-preservation incidents.


From: Steven White [mailto:swhite4...@gmail.com]
Sent: Thursday, February 11, 2016 10:18 AM
To: user@tika.apache.org
Subject: Re: Using tika-app-1.11.jar

Thank you Nick and everyone who has helped me with my questions.

I'm now understand Tika much better vs. where I was at last week when I first 
looked at it.

Steve

On Thu, Feb 11, 2016 at 8:18 AM, Nick Burch 
> wrote:
On Wed, 10 Feb 2016, Steven White wrote:
I'm including tika-app-1.11.jar with my application and see that Tika
includes "slf4j".

The Tika App single jar is intended for standalone use. It's not generally 
recommended to be included as part of a wider application, as it tends to 
include everything and the kitchen sink, to allow for easy standalone use

Generally, you should just tell Maven / Groovy / Ivy that you want to depend on 
Tika Core + Tika Parsers, then your build tool will fetch + bundle all the 
dependencies for you. That lets you have proper control over conflicting 
versions of jars etc

Nick



RE: script tags in LinkContentHandler

2016-04-06 Thread Allison, Timothy B.
On #2, I'd prefer not skipping elements.  I definitely understand the use case 
to extract what a human can see, but I suspect if your email address ends in 
'forensics.com', you'd probably like to see everything as well.

-Original Message-
From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] 
Sent: Wednesday, April 06, 2016 4:14 PM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) 

RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.
Hi Chris,
  Good to hear from you.  We do still use Jempbox in 1.12 for the PDFParser and 
the JempboxExtractor.  The RTF must have an embedded PDF or Jpeg or another 
image file.
  Is there any chance Maven is not smiling upon you with transitive 
dependencies?  When you bundle your app are you including all dependencies?
  Very strange that it isn’t showing up in the dependency tree…

  Hmmm…
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 12:14 PM
To: user@tika.apache.org
Subject: Jempbox runtime error

Hi

I recently upgraded to tika 1.12 from 1.7 and read the notes about Jempbox 
being no longer used.  My pom now pulls in 1.12 versions of tika-core, 
tika-parsers, tika-xmp and tika-bundle.
The app is running well but very occasionally we see:

 java.lang.NoClassDefFoundError: org/apache/jempbox/xmp/XMPMetadata

It is happening on an RTF file which unfortunately I cannot share.

I have generated a maven dependency tree and there is no mention of jempbox in 
there at all.  Has anyone seen this issue or have any ideas of what I could try?

Thanks,

- Chris







[ Our 
Blog
 ]   [ 
Twitter
 ]   [ 
YouTube
 ]






Chris Bamford
Lead Software Engineer

m: +44 7860 405292


CityPoint, One Ropemaker Street, London, EC2Y 9AW.


+44 (0) 207 847 8700






Disclaimer
The information contained in this communication from 
cbamf...@mimecast.com sent at 2016-04-22 17:13:51 
is confidential and may be legally privileged. It is intended solely for use by 
user@tika.apache.org and others authorized to 
receive it. If you are not user@tika.apache.org 
you are hereby notified that any disclosure, copying, distribution or taking 
action in reliance of the contents of this information is strictly prohibited 
and may be unlawful.

Mimecast Ltd. is a company registered in England and Wales with the company 
number 4698693 VAT No. GB 832 5179 29
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW 
Email Address: i...@mimecast.com


This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based 
platform.
For more information please visit http://www.mimecast.com



RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.

That should be in our tika-parsers’ pom

1.8.11

So, um, where did you see that we had dropped Jempbox?  I know that we wanted 
to at some point, but XMPBox only works on PDF/A so we aren’t going to move to 
that any time soon.

Cheers,

  Tim


From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 12:51 PM
To: user@tika.apache.org
Subject: Re: Jempbox runtime error

Hi Tim,

Nice to hear from you too - and thanks for the quick reply!

Good to know about the dependency, will try to include it (what version do you 
recommend?).

Thanks

- Chris


Chris Bamford

m: +44 7860 405292

www.mimecast.com<http://www.mimecast.com/>

Lead Software Engineer

p: +44 207 847 8700

Address click here<http://www.mimecast.com/About-us/Contact-us/>


[http://www.mimecast.com]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=1d7dcd0d6b645d7adffb6c266d10bd1e>




[LinkedIn]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=0a3d27cb162a3239c064921a4c5aa231>


[YouTube]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=19a4f40b085f561c8417b232cecba3b8>


[Facebook]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=4befc68ae3c36b74613befac61365f92>


[Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=c18e757b199760a7639b14a093ecc854>


[Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=88cffd899bb6263568309604cc938d96>







On 22 Apr 2016, at 17:39, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:

Hi Chris,
  Good to hear from you.  We do still use Jempbox in 1.12 for the PDFParser and 
the JempboxExtractor.  The RTF must have an embedded PDF or Jpeg or another 
image file.
  Is there any chance Maven is not smiling upon you with transitive 
dependencies?  When you bundle your app are you including all dependencies?
  Very strange that it isn’t showing up in the dependency tree…

  Hmmm…
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 12:14 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Jempbox runtime error

Hi

I recently upgraded to tika 1.12 from 1.7 and read the notes about Jempbox 
being no longer used.  My pom now pulls in 1.12 versions of tika-core, 
tika-parsers, tika-xmp and tika-bundle.
The app is running well but very occasionally we see:

 java.lang.NoClassDefFoundError: org/apache/jempbox/xmp/XMPMetadata

It is happening on an RTF file which unfortunately I cannot share.

I have generated a maven dependency tree and there is no mention of jempbox in 
there at all.  Has anyone seen this issue or have any ideas of what I could try?

Thanks,

- Chris







[ Our 
Blog<https://serviceA.mimecast.com/mimecast/click?account=C1A1=d6a6a16cc391eeea05fbe4932cfbd281>
 ]   [ 
Twitter<https://serviceA.mimecast.com/mimecast/click?account=C1A1=1fa6e400d55dc8eaac4d686256abba88>
 ]   [ 
YouTube<https://serviceA.mimecast.com/mimecast/click?account=C1A1=6b2d1f4cf3e8ba6c5f96421cd53dc0d8>
 ]






Chris Bamford
Lead Software Engineer

m: +44 7860 405292


CityPoint, One Ropemaker Street, London, EC2Y 9AW.


+44 (0) 207 847 8700





Disclaimer
The information contained in this communication from 
cbamf...@mimecast.com<mailto:cbamf...@mimecast.com> sent at 2016-04-22 17:13:51 
is confidential and may be legally privileged. It is intended solely for use by 
user@tika.apache.org<mailto:user@tika.apache.org> and others authorized to 
receive it. If you are not user@tika.apache.org<mailto:user@tika.apache.org> 
you are hereby notified that any disclosure, copying, distribution or taking 
action in reliance of the contents of this information is strictly prohibited 
and may be unlawful.

Mimecast Ltd. is a company registered in England and Wales with the company 
number 4698693 VAT No. GB 832 5179 29
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW 
Email Address: i...@mimecast.com<mailto:i...@mimecast.com>

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based 
platform.
For more information please visit http://www.mimecast.com





RE: Jempbox runtime error

2016-04-22 Thread Allison, Timothy B.
Ok, phew.  Yes, they are, but we’re not…yet. ☺

Tika 1.13 should be around the corner, and that’ll include PDFBox 2.0 (and 
Jempbox!).

Best,

  Tim

From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 1:05 PM
To: user@tika.apache.org
Subject: Re: Jempbox runtime error

Thanks.

No, it was my confusion - PDFBox (which is also part of our app) has recently 
dropped it (see http://pdfbox.apache.org/2.0/migration.html).
So we may be actively managing it out - will revisit and hopefully all will be 
good.

Cheers,

- Chris


Chris Bamford

m: +44 7860 405292

www.mimecast.com<http://www.mimecast.com/>

Lead Software Engineer

p: +44 207 847 8700

Address click here<http://www.mimecast.com/About-us/Contact-us/>


[http://www.mimecast.com]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=0ee97977eba7ea83596102e98e887681>




[LinkedIn]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=0af5ddd0c3a0a8b6be71b5b69b09b513>


[YouTube]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=1d33186193043ac11aa2de88fab2cd79>


[Facebook]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=89480d9b115cbf17a99e17bd11045609>


[Blog]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=7a9d8ba1eab0c90c3cdda0ff306625c2>


[Twitter]<https://serviceB.mimecast.com/mimecast/click?account=C1A1=d05873ca23f5f82ca4bbe30ab29477c0>







On 22 Apr 2016, at 17:54, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:


That should be in our tika-parsers’ pom

1.8.11

So, um, where did you see that we had dropped Jempbox?  I know that we wanted 
to at some point, but XMPBox only works on PDF/A so we aren’t going to move to 
that any time soon.

Cheers,

  Tim


From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 12:51 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: Jempbox runtime error

Hi Tim,

Nice to hear from you too - and thanks for the quick reply!

Good to know about the dependency, will try to include it (what version do you 
recommend?).

Thanks

- Chris


Chris Bamford

m: +44 7860 405292

www.mimecast.com<http://www.mimecast.com/>

Lead Software Engineer

p: +44 207 847 8700

Address click here<http://www.mimecast.com/About-us/Contact-us/>


<https://serviceB.mimecast.com/mimecast/click?account=C1A1=1d7dcd0d6b645d7adffb6c266d10bd1e>




<https://serviceB.mimecast.com/mimecast/click?account=C1A1=0a3d27cb162a3239c064921a4c5aa231>


<https://serviceB.mimecast.com/mimecast/click?account=C1A1=19a4f40b085f561c8417b232cecba3b8>


<https://serviceB.mimecast.com/mimecast/click?account=C1A1=4befc68ae3c36b74613befac61365f92>


<https://serviceB.mimecast.com/mimecast/click?account=C1A1=c18e757b199760a7639b14a093ecc854>


<https://serviceB.mimecast.com/mimecast/click?account=C1A1=88cffd899bb6263568309604cc938d96>








On 22 Apr 2016, at 17:39, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:

Hi Chris,
  Good to hear from you.  We do still use Jempbox in 1.12 for the PDFParser and 
the JempboxExtractor.  The RTF must have an embedded PDF or Jpeg or another 
image file.
  Is there any chance Maven is not smiling upon you with transitive 
dependencies?  When you bundle your app are you including all dependencies?
  Very strange that it isn’t showing up in the dependency tree…

  Hmmm…
From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Friday, April 22, 2016 12:14 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Jempbox runtime error

Hi

I recently upgraded to tika 1.12 from 1.7 and read the notes about Jempbox 
being no longer used.  My pom now pulls in 1.12 versions of tika-core, 
tika-parsers, tika-xmp and tika-bundle.
The app is running well but very occasionally we see:

 java.lang.NoClassDefFoundError: org/apache/jempbox/xmp/XMPMetadata

It is happening on an RTF file which unfortunately I cannot share.

I have generated a maven dependency tree and there is no mention of jempbox in 
there at all.  Has anyone seen this issue or have any ideas of what I could try?

Thanks,

- Chris







[ Our 
Blog<https://serviceA.mimecast.com/mimecast/click?account=C1A1=d6a6a16cc391eeea05fbe4932cfbd281>
 ]   [ 
Twitter<https://serviceA.mimecast.com/mimecast/click?account=C1A1=1fa6e400d55dc8eaac4d686256abba88>
 ]   [ 
YouTube<https://serviceA.mimecast.com/mimecast/click?account=C1A1=6b2d1f4cf3e8ba6c5f96421cd53dc0d8>
 ]






Chris Bamford
Lead Software Engineer

m: +44 7860 405292


CityPoint, One Ropemaker Street, London, EC2Y 9AW.


+44 (0) 207 847 8700





Disclaimer
The information contained in this communication from 
cbamf...@mimecast.com<mailto:cbamf...@mimecast.com> sent at 2016-04-22 17:13:51 
is confidential and

RE: [VOTE] Release Apache Tika 1.13 Candidate #1

2016-05-11 Thread Allison, Timothy B.
+1

Built on Windows and Linux.  I'm relying on earlier pre-release tests for no 
surprises. :)

Thank you, Dave!

-Original Message-
From: David Meikle [mailto:loo...@gmail.com] On Behalf Of David Meikle
Sent: Monday, May 9, 2016 3:35 PM
To: d...@tika.apache.org; user@tika.apache.org
Subject: [VOTE] Release Apache Tika 1.13 Candidate #1

A candidate for the Tika 1.13 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  
https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=18fa8213438183a249df4f52535031670f0a3eef

The SHA1 checksum of the archive is
  8a591e7ea29dca14d5f25b44b3a2a35425676c64.

In addition, a staged maven repository is available here:
  
https://repository.apache.org/content/repositories/orgapachetika-1019/org/apache/tika

Please vote on releasing this package as Apache Tika 1.13.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.13 [ ] -1 Do not release this 
package because…

Here is my +1 for the release.

Cheers,
Dave

P.S. For anyone looking to test using the Apache Tika Server I have put up a 
branch that pulls down the RC at 
https://github.com/LogicalSpark/docker-tikaserver/tree/1.13rc1


RE: Need Help

2016-05-11 Thread Allison, Timothy B.
Haven’t gotten around to this yet.  Sorry.

Anyone else have any input?

From: harsh kumar [mailto:kumarhars...@gmail.com]
Sent: Friday, May 6, 2016 8:48 AM
To: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: Need Help

Hey Timothy,

Can you please help me with your findings of the TIKA? I would be thankful to 
you for this.

--Harsh


On Tue, Apr 19, 2016 at 6:51 PM, harsh kumar 
<kumarhars...@gmail.com<mailto:kumarhars...@gmail.com>> wrote:
Hey Timothy,

Thanks for your reply.

It would be a great help if you can give your findings to me.
Can you please help me with some specific email id to reach for the same.


-- Forwarded message ------
From: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>
Date: Mon, Apr 18, 2016 at 7:42 PM
Subject: RE: Need Help
To: "user@tika.apache.org<mailto:user@tika.apache.org>" 
<user@tika.apache.org<mailto:user@tika.apache.org>>
Cc: "kumarhars...@gmail.com<mailto:kumarhars...@gmail.com>" 
<kumarhars...@gmail.com<mailto:kumarhars...@gmail.com>>


Ha.  I'm in the process of comparing mimetype detection results from DROID, 
Tika and 'file' on our TIKA-1302 corpus.

After that, I was going to compare our different encoding detectors on the 
corpus...I'll have a better answer in a few weeks.

Others on this list probably have more info, but our general Encoding detector 
tries to get the encoding from an html meta charset info, then the 
UniversalEncodingDetector and then the Icu4JDetector.  It stops when the first 
encoding detector returns a non-null answer.  That order was initially set in 
July 2012, and we haven't changed it since.

In short, this is an area for further analysis.

-Original Message-
From: Mattmann, Chris A (3980) 
[mailto:chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>]
Sent: Monday, April 18, 2016 9:59 AM
To: d...@tika.apache.org<mailto:d...@tika.apache.org>
Subject: Fwd: Need Help



Sent from my iPhone

Begin forwarded message:

From: harsh kumar 
<kumarhars...@gmail.com<mailto:kumarhars...@gmail.com><mailto:kumarhars...@gmail.com<mailto:kumarhars...@gmail.com>>>
Date: April 18, 2016 at 2:02:23 AM PDT
To: 
<dev-ow...@tika.apache.org<mailto:dev-ow...@tika.apache.org><mailto:dev-ow...@tika.apache.org<mailto:dev-ow...@tika.apache.org>>>
Subject: Fwd: Need Help

Hi,

I am using tika for detecting the encoding of a file. But I found that the 
results are not uniform If I use charsetdetector and universalEncodingdetector 
for the same file.

Can you please brief me with the major differences between them and their 
best-fit use cases.

Looking forward to your early reply.

--
Warm Regards.*
Harsh Kumar



--
Warm Regards…..•
Harsh Kumar





--
Warm Regards…..•
Harsh Kumar




RE: My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Allison, Timothy B.
Great slides.  Thank you, Nick.  Wish I could be there...

Any feedback/guidance from the audience?

-Original Message-
From: Nick Burch [mailto:n...@apache.org] 
Sent: Wednesday, May 11, 2016 5:09 PM
To: user@tika.apache.org
Cc: d...@tika.apache.org
Subject: My "What's new with Apache Tika 2.0" talk slides

Hi All

For those who couldn't make it to Vancouver this week, the slides from my 
"What's new with Apache Tika 2.0" talk are now available online:
http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20

The audio was recorded, hopefully that will be available to go with the slides 
in a few days time

Nick


RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
Our AutoDetectReader does correctly identify the encoding in this case.

Do we want to add logic that checks for ??, and if that doesn’t exist 
then use our AutoDetectReader?

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, May 16, 2016 11:15 AM
To: user@tika.apache.org
Subject: RE: Tika response encoding problem

The underlying james mime4j parser isn’t properly detecting utf-8 in the .txt 
file.  In the .eml file, the fields declare their encoding:

From: =?utf-8?Q?Philipp_Steinkr=C3=BCger?= 
philipp.steinkrue...@uni-koeln.de<mailto:philipp.steinkrue...@uni-koeln.de>

Not sure how we’d want to fix this.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, May 16, 2016 8:04 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Tika response encoding problem

>>I also tried to use tika-app, since I saw in --help that I can pass the 
>>--encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter 
specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with 
“Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with 
“Test-email-empty-works.txt”.  I get the same behavior when I redirect the 
output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt



Bizarrely, it looks like both files are being parsed by the RFC822Parser, and 
when I run the “detect” commandline option –d, on both files with 1.12 and 
trunk, both say RFC822.






From: Philipp Steinkrüger [mailto:philipp.steinkrue...@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this 
command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well 
as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page 
(https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a 
clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the 
umlauts are represented by ‘??’; for (2) they are represented by 'ü’ (that is 
a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a 
python script (with Chris Mattmann’s module). If this behaviour can be 
controlled from within python, that would be fine for me. But since I got the 
problem also using curl and tika-app, I thought that the problem is more likely 
to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp




RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
The underlying james mime4j parser isn’t properly detecting utf-8 in the .txt 
file.  In the .eml file, the fields declare their encoding:

From: =?utf-8?Q?Philipp_Steinkr=C3=BCger?= 
philipp.steinkrue...@uni-koeln.de<mailto:philipp.steinkrue...@uni-koeln.de>

Not sure how we’d want to fix this.

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, May 16, 2016 8:04 AM
To: user@tika.apache.org
Subject: RE: Tika response encoding problem

>>I also tried to use tika-app, since I saw in --help that I can pass the 
>>--encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter 
specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with 
“Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with 
“Test-email-empty-works.txt”.  I get the same behavior when I redirect the 
output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt



Bizarrely, it looks like both files are being parsed by the RFC822Parser, and 
when I run the “detect” commandline option –d, on both files with 1.12 and 
trunk, both say RFC822.






From: Philipp Steinkrüger [mailto:philipp.steinkrue...@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this 
command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well 
as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page 
(https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a 
clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the 
umlauts are represented by ‘??’; for (2) they are represented by 'ü’ (that is 
a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a 
python script (with Chris Mattmann’s module). If this behaviour can be 
controlled from within python, that would be fine for me. But since I got the 
problem also using curl and tika-app, I thought that the problem is more likely 
to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp




RE: Tika response encoding problem

2016-05-16 Thread Allison, Timothy B.
>>I also tried to use tika-app, since I saw in --help that I can pass the 
>>--encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter 
specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with 
“Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with 
“Test-email-empty-works.txt”.  I get the same behavior when I redirect the 
output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt



Bizarrely, it looks like both files are being parsed by the RFC822Parser, and 
when I run the “detect” commandline option –d, on both files with 1.12 and 
trunk, both say RFC822.






From: Philipp Steinkrüger [mailto:philipp.steinkrue...@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this 
command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well 
as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page 
(https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a 
clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the 
--encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the 
umlauts are represented by ‘??’; for (2) they are represented by 'ü’ (that is 
a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a 
python script (with Chris Mattmann’s module). If this behaviour can be 
controlled from within python, that would be fine for me. But since I got the 
problem also using curl and tika-app, I thought that the problem is more likely 
to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp




RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
>> While PDFBox is a part of TIKA and the two projects are kindof "best friends 
>> forever"
Thank you, Tilman! :)


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Saturday, April 30, 2016 5:24 PM
To: us...@pdfbox.apache.org
Subject: Re: is it possible to batch extract text from pdf files within a tree 
of folders within a zip file ?

Am 30.04.2016 um 19:46 schrieb David Green:
> you may gather that i am new to this.
> my original zip files containing pdf files are on my f drive I want 
> the unpacked text files saved in an identical directory structure on 
> my g drive I have tried:
>  java -jar tika-app.X.Y.jar -J -t -i  -o  resulted in 
> "syntax error"
> can you please suggest what I'm doing wrong

You're in the wrong mailing list. This is the PDFBox mailing list. While PDFBox 
is a part of TIKA and the two projects are kindof "best friends forever", this 
doesn't mean that PDFBox users all know how to use TIKA.

However I suspect that you actually used the "<" and ">". The "<" and ">" are 
there to explain a concept. So your command line would probably be

java -jar tika-app.X.Y.jar -J -t -i f: -o g:



Tilman


>


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

2016-05-02 Thread Allison, Timothy B.
The commandline I gave you outputs JSON files.  If you open them in a text/JSON 
editor, you should see valid data.  If they're corrupt, please let us know!

If you're able to process JSON files, you should be good to go.  Otherwise, the 
recommendation to use Java's ZipFile API and do the unzipping yourself is 
probably the best option.  

In Tika, we do have a -z option to extract embedded files, but that only 
extracts the first level of documents and it doesn't reproduce the original 
file structure. If you have zips within zips, you won't get the content.

 
-Original Message-
From: davidgreen.co...@gmail.com [mailto:davidgreen.co...@gmail.com] On Behalf 
Of David Green
Sent: Saturday, April 30, 2016 9:07 PM
To: us...@pdfbox.apache.org
Subject: Re: is it possible to batch extract text from pdf files within a tree 
of folders within a zip file ?

sorry for using wrong forum
is there a tika forum ?

your suggested command is working of a fashion java -jar 
c:\jars\tika-app-1.12.jar -J -t -i f: -o g:
the directory structure is being reproduced but the zip files are being copied 
as zip files (I think) the copied files retain the original filename (including 
the original zip
extension) with an additional json extension though when I try to open the file 
using B1 file archiver, it reports a corrupt file.


RE: Apache Tika wikipedia page

2016-04-15 Thread Allison, Timothy B.
Fantastic.  Thank you!

Have a great weekend!

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Friday, April 15, 2016 7:22 PM
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: Apache Tika wikipedia page

Hi All,

I made a Wikipedia page for Apache Tika:

https://en.wikipedia.org/wiki/Apache_Tika


Please update and edit. Thank you.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate 
Professor, Computer Science Department University of Southern California, Los 
Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







RE: Need Help

2016-04-18 Thread Allison, Timothy B.
Ha.  I'm in the process of comparing mimetype detection results from DROID, 
Tika and 'file' on our TIKA-1302 corpus.

After that, I was going to compare our different encoding detectors on the 
corpus...I'll have a better answer in a few weeks.

Others on this list probably have more info, but our general Encoding detector 
tries to get the encoding from an html meta charset info, then the 
UniversalEncodingDetector and then the Icu4JDetector.  It stops when the first 
encoding detector returns a non-null answer.  That order was initially set in 
July 2012, and we haven't changed it since.

In short, this is an area for further analysis.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, April 18, 2016 9:59 AM
To: d...@tika.apache.org
Subject: Fwd: Need Help



Sent from my iPhone

Begin forwarded message:

From: harsh kumar >
Date: April 18, 2016 at 2:02:23 AM PDT
To: >
Subject: Fwd: Need Help

Hi,

I am using tika for detecting the encoding of a file. But I found that the 
results are not uniform If I use charsetdetector and universalEncodingdetector 
for the same file.

Can you please brief me with the major differences between them and their 
best-fit use cases.

Looking forward to your early reply.

--
Warm Regards.*
Harsh Kumar



RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
Charset detection _should_ be thread safe.  If you can help us track down the 
problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

 Tim

-Original Message-
From: c.leitin...@lirum.at [mailto:c.leitin...@lirum.at] 
Sent: Monday, July 25, 2016 6:01 PM
To: user@tika.apache.org
Subject: Is Tika (especially CharsetDetector) considered thread-safe?

Hi,

I am working in a project where Tika is getting used in a heavily 
multi-threaded environment. Lately, there have been some issues where character 
set detection in isolation gives plausible results, while running it in 
parallel gives results that are way off.

The root cause has not yet been found, but within the team, there was quite 
some finger-pointing towards Tika's thread-safety and lots of FUD especially 
around org.apache.tika.parser.txt.CharsetDetector.

But it seems no one in our team reached out or cared to either bug report or 
ask on the mailing list.

So just to get rid of the FUD: Is
org.apache.tika.parser.txt.CharsetDetector considered to be thread-safe?
(Some bugs suggest that Tika cares about thread-safety, but I could not find 
anything in the javadoc for CharsetDetector)

Thanks and Best regards,
Christian


P.S.: We're building a fresh, new CharSetDetector for each byte array that 
should have the character set encoding detected. And only the thread that 
created the CharSetDetector is using it.


P.P.S.: We're still using Tika 1.9.


RE: Is Tika (especially CharsetDetector) considered thread-safe?

2016-07-25 Thread Allison, Timothy B.
With 1.13 and this code, I'm not able to see any problems with our handful of 
test files in our unit tests.  

Exactly what code are you using?  How are you doing detection?


@Test
public void testMultiThreadedEncodingDetection() throws Exception {
Path testDocs = 
Paths.get(this.getClass().getResource("/test-documents").toURI());
List paths = new ArrayList<>();
Map<Path, String> encodings = new ConcurrentHashMap<>();
for (File file : testDocs.toFile().listFiles()) {
if (file.getName().endsWith(".txt") || 
file.getName().endsWith(".html")) {
String encoding = getEncoding(file.toPath());
paths.add(file.toPath());
encodings.put(file.toPath(), encoding);
}
}
for (int i = 0; i < 100; i++) {
new Thread(new EncodingDetector(paths, encodings)).run();
}
assertTrue("success!", true);
}

private class EncodingDetector implements Runnable {
private final List paths;
private final Map<Path, String> encodings;
private final Random r = new Random();
private EncodingDetector(List paths, Map<Path, String> encodings) 
{
this.paths = paths;
this.encodings = encodings;
}

@Override
public void run() {
for (int i = 0; i < 100; i++) {
int pInd = r.nextInt(paths.size());
String detectedEncoding = null;
try {
detectedEncoding = getEncoding(paths.get(pInd));
} catch (Exception e) {
throw new RuntimeException(e);
}
String trueEncoding = encodings.get(paths.get(pInd));
if (! detectedEncoding.equals(trueEncoding)) {
throw new RuntimeException("detected: " + detectedEncoding +
" but should have been: "+trueEncoding);
}
}
}
}

public String getEncoding(Path p) throws Exception {
try (InputStream is = TikaInputStream.get(p)) {
AutoDetectReader reader = new AutoDetectReader(is);
String val = reader.getCharset().toString();
if (val == null) {
return "NULL";
    } else {
    return val;
}
}
}

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 25, 2016 9:21 PM
To: user@tika.apache.org
Subject: RE: Is Tika (especially CharsetDetector) considered thread-safe?

Charset detection _should_ be thread safe.  If you can help us track down the 
problem (unit test?), we need to fix this.

Thank you for raising this.

Best,

 Tim



RE: detect corrupt file and build a list of them before indexing in solr

2016-07-15 Thread Allison, Timothy B.
Checking for 0 byte files is one option.  The other option is to configure the 
logs to capture exceptions.  I’ve attached the config files and the shell 
script that I use when running our large scale regression testing here: 
https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile=view=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, 
update the shell script for your  and your  and you 
should be good to go.  You may need to create a “logs” directory.  Exceptions 
will be recorded in the batch-process-warn.log, and original file names are 
included along with stack traces.

From: kostali hassan [mailto:med.has.kost...@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using 
apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of 
corrupted file. if its possible.
I try runing java -jar tika-app.jar   I get in the 
output_dir all the files of  in format xml and all the corrupt file 
with size 0ko (empty)


RE: Extract Text from a TIFF image

2016-07-18 Thread Allison, Timothy B.
You'll need to set up tesseract to run Optical Character Recognition.  While we 
have an integration with OCR, it is not bundled within the app.

See https://wiki.apache.org/tika/TikaOCR

For kicks, I ran this through Tika+Tesseract; this is the output you get once 
you've set up Tesseract:

SUPPLIER: 3177  Invoice Date Description Amount Discount Net Amount 015-28339 
06/08/2015 21,318.54 0.00 21,318.54 C15-28837 06/04/2015 1,529.75 0.00 1,529.75 
01528978 06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06/04/2015 1,182.85 
0.00 1,182.85 015-28439 06/01/2015 1,113.86 0.00 1,113.86 C15-29707 06/11/2015 
886.84 0.00 886.64 C15-28978-02 06/04/2015 526.91 0.00 526.91 01529385 
06/09/2015 199.29 0.00 199.29 C15~28439~01 06/03/2015 157.34 0.00 157.34 
C15-28670 06/03/2015 136.52 0.00 136.52 C15-28314-01 06/03/2015 132.81 0.00 
132.81 015-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #: 83077 Cheque Date 7/14/2015 28,506.32 0.00 28,506.32  SUPPLIER: 
3177  Invoice Date Description Amount Discount Net Amount C15-28339 06/08/2015 
21,318.54 0.00 21,318.54 015-28837 06/04/2015 1,529.75 0.00 1,529.75 015-28978 
06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06I04/2015 1 ,18285 0.00 
1,182.85 C15-28439 06/01/2015 1,113.86 0.00 1,113.86 015-29707 06l11/2015 
886.64 0.00 886.64 C15-28978~02 06/04/2015 526.91 0.00 526.91 015-29385 
06/09/2015 199.29 0.00 199.29 C15-28439-01 06/03/2015 157.34 0.00 157.34 
015-28670 06/03/2015 136.52 0.00 136.52 015-28314-01 06/03/2015 132.81 0.00 
132.81 C15-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #1 83077 Check Daie: 7/14/2015 28,506.32 0.00 28,506.32  07142015 
MMDD  TWENTY-EIGHT THOUSAND FIVE HUNDRED SIX CAD AND 32/ 100 $ 
"**28,506.32  Trans Am Piping Canada

From: Gordon Schneider [mailto:schneid...@transampiping.com]
Sent: Monday, July 18, 2016 4:05 PM
To: 'user@tika.apache.org' 
Subject: Extract Text from a TIFF image

I have tried using the GUI for tika-app-1.13 but it shows nothing. I can see 
the metdata but that does not give me the information I need. I have attached 
the file.

Maybe it is not possible to extract the text. If so what should I be looking 
for to tell me that it cannot extract the text.

Thanks


Gordon Schneider
403-236-0601
Trans Am Piping Products Ltd.



RE: Extract Text from a TIFF image

2016-07-19 Thread Allison, Timothy B.
You might want to experiment with different -psm values, we use 1 by default.

Also, which version of Tesseract? I think I got mine from 
(https://github.com/UB-Mannheim/tesseract/wiki), version:

tesseract 3.05.00dev
leptonica-1.73
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 
4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0



From: Gordon Schneider [mailto:schneid...@transampiping.com]
Sent: Tuesday, July 19, 2016 11:22 AM
To: 'user@tika.apache.org' <user@tika.apache.org>
Subject: RE: Extract Text from a TIFF image

I installed tesseract on my PC. I ran tesseract on its own using the following 
command:

tesseract.exe x:/java/PDFBox/Maxfield-1.tiff x:/java/PDFBox/Maxfield-1

The results are in the attached file. Not as clean as the results Timothy got. 
I am closer to where I want to get to but obviously I am a number of steps to 
my ideal solution. How to get the same results Timothy got?

Thanks

Gord


From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: July 18, 2016 2:25 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Extract Text from a TIFF image

You'll need to set up tesseract to run Optical Character Recognition.  While we 
have an integration with OCR, it is not bundled within the app.

See https://wiki.apache.org/tika/TikaOCR

For kicks, I ran this through Tika+Tesseract; this is the output you get once 
you've set up Tesseract:

SUPPLIER: 3177  Invoice Date Description Amount Discount Net Amount 015-28339 
06/08/2015 21,318.54 0.00 21,318.54 C15-28837 06/04/2015 1,529.75 0.00 1,529.75 
01528978 06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06/04/2015 1,182.85 
0.00 1,182.85 015-28439 06/01/2015 1,113.86 0.00 1,113.86 C15-29707 06/11/2015 
886.84 0.00 886.64 C15-28978-02 06/04/2015 526.91 0.00 526.91 01529385 
06/09/2015 199.29 0.00 199.29 C15~28439~01 06/03/2015 157.34 0.00 157.34 
C15-28670 06/03/2015 136.52 0.00 136.52 C15-28314-01 06/03/2015 132.81 0.00 
132.81 015-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #: 83077 Cheque Date 7/14/2015 28,506.32 0.00 28,506.32  SUPPLIER: 
3177  Invoice Date Description Amount Discount Net Amount C15-28339 06/08/2015 
21,318.54 0.00 21,318.54 015-28837 06/04/2015 1,529.75 0.00 1,529.75 015-28978 
06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06I04/2015 1 ,18285 0.00 
1,182.85 C15-28439 06/01/2015 1,113.86 0.00 1,113.86 015-29707 06l11/2015 
886.64 0.00 886.64 C15-28978~02 06/04/2015 526.91 0.00 526.91 015-29385 
06/09/2015 199.29 0.00 199.29 C15-28439-01 06/03/2015 157.34 0.00 157.34 
015-28670 06/03/2015 136.52 0.00 136.52 015-28314-01 06/03/2015 132.81 0.00 
132.81 C15-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #1 83077 Check Daie: 7/14/2015 28,506.32 0.00 28,506.32  07142015 
MMDD  TWENTY-EIGHT THOUSAND FIVE HUNDRED SIX CAD AND 32/ 100 $ 
"**28,506.32  Trans Am Piping Canada

From: Gordon Schneider [mailto:schneid...@transampiping.com]
Sent: Monday, July 18, 2016 4:05 PM
To: 'user@tika.apache.org' <user@tika.apache.org<mailto:user@tika.apache.org>>
Subject: Extract Text from a TIFF image

I have tried using the GUI for tika-app-1.13 but it shows nothing. I can see 
the metdata but that does not give me the information I need. I have attached 
the file.

Maybe it is not possible to extract the text. If so what should I be looking 
for to tell me that it cannot extract the text.

Thanks


Gordon Schneider
403-236-0601
Trans Am Piping Products Ltd.



RE: detect corrupt file and build a list of them before indexing in solr

2016-07-15 Thread Allison, Timothy B.
Rename the shell script’s extension to end in .bat and you should be good to go.


From: kostali hassan [mailto:med.has.kost...@gmail.com]
Sent: Friday, July 15, 2016 1:26 PM
To: user@tika.apache.org
Subject: Re: detect corrupt file and build a list of them before indexing in 
solr

I USE TIKA_app1.12

2016-07-15 18:20 GMT+01:00 Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>>:
Can you share the shell script/bat file you’re using?

From: kostali hassan 
[mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>]
Sent: Friday, July 15, 2016 1:13 PM

To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in 
solr

when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException: 
Crawler couldn't find this directory:D:\tika_batch_config\test
the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException: 
Crawler couldn't find this directory: D:\tika_batch_config\Cvs

2016-07-15 17:54 GMT+01:00 kostali hassan 
<med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>>:
I added this directorry ANd still not working

2016-07-15 17:42 GMT+01:00 Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>>:
Y, the log tells you that the input directory wasn’t specified correctly:

1375 2016-07-15 17:33:17,354 [Thread-2] INFO  
org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess: 
java.lang.RuntimeException: Crawler couldn't find this 
directory:D:\tika_batch_config\test

From: kostali hassan 
[mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>]
Sent: Friday, July 15, 2016 12:40 PM

To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in 
solr

only JXmx1g work AND the inputDIR is empty AND I get this files empty in logs :
batch-driver-warn.log
batch-process-warn.log
tika-batch-pdfbox.log

AND this attached files

2016-07-15 16:36 GMT+01:00 Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>>:
Try changing the max heap to something that will work on your computer:

-JXmx5g

To (say):

-JXmx1g
From: kostali hassan 
[mailto:med.has.kost...@gmail.com<mailto:med.has.kost...@gmail.com>]
Sent: Friday, July 15, 2016 11:27 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in 
solr

I get this files in the logs ; AND when I run the script he dont finich he 
restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>>:
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a 
restart (hang/oom); and depending on cause, you may get an error logged in 
batch-process-error.xml.  If your OS kills the process or something truly 
catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: 
TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: detect corrupt file and build a list of them before indexing in 
solr

Checking for 0 byte files is one option.  The other option is to configure the 
logs to capture exceptions.  I’ve attached the config files and the shell 
script that I use when running our large scale regression testing here: 
https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile=view=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, 
update the shell script for your  and your  and you 
should be good to go.  You may need to create a “logs” directory.  Exceptions 
will be recorded in the batch-process-warn.log, and original file names are 
included along with stack traces.

From: kostali hassan [mailto:med.has.kost...@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using 
apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of 
corrupted file. if its possible.
I try runing java -jar tika-app.jar   I get in the 
output_dir all the files of  in format xml and all the corrupt file 
with size 0ko (empty)







RE: RE: PDFPaser generates gibberish

2016-07-01 Thread Allison, Timothy B.
Ah, ok, nothing we can do about it then.  Sorry.

>One more thing…
That sounds like a new line issue.  Notepad doesn’t understand \n, whereas 
WordPad and MSWord do.

From: Allison A. [mailto:alliso...@gmail.com]
Sent: Friday, July 1, 2016 1:07 AM
To: user@tika.apache.org
Subject: Re: RE: PDFPaser generates gibberish

Many thanks, yes, the PDFBox generates the gibberish. One more thing, when I 
opened the extracted text with Notepad, it is not showing. Clearly it appears 
in WordPad, MS Word, etc.

Is this about an encoding issue?



RE: Rest API Documentation

2017-01-23 Thread Allison, Timothy B.
Y, our license appears to have expired.

Chris/Tyler,
  Any chance you could re-up our license?

From: ネイト・フィンドリー [mailto:nat...@zenlok.com]
Sent: Saturday, January 21, 2017 6:30 PM
To: user@tika.apache.org
Subject: Rest API Documentation

The Miredot link no longer produces documentation.  Is there another place 
these can be read?


RE: Extracting vector graphics from pdf

2017-02-28 Thread Allison, Timothy B.
Thank you, Tilman!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, February 27, 2017 9:38 AM
To: us...@pdfbox.apache.org
Cc: user@tika.apache.org
Subject: Re: Extracting vector graphics from pdf

http://stackoverflow.com/a/38933039/535646

This allows to collect the lines. However it won't output an image.

Tilman

Am 27.02.2017 um 13:20 schrieb Allison, Timothy B.:
> PDFBox Colleagues,
>Any recommendations?
>
>Best,
>
>   Tim
>
> -Original Message-
> From: Andisa Dewi [mailto:theknight...@yahoo.com]
> Sent: Monday, February 27, 2017 5:32 AM
> To: user@tika.apache.org
> Subject: Extracting vector graphics from pdf
>
> Hello guys,
>
> I'm currently extracting images from a whole lot of pdf files, however some 
> of images (or figures) are somehow not extracted. I'm thinking it might have 
> to do with the fact that those images are vector graphics (as usually the 
> case in a lot of scientific papers). My question is, is it possible to 
> extract vector graphics from pdfs using Tika?
>
> I attached an example of the pdf (here for example, all images are extracted 
> except Figure 2).
>
> The way I'm extracting the images are the same as in the example code:
>
> Parser parser = new AutoDetectParser(); Metadata m = new Metadata(); 
> ParseContext c = new ParseContext(); ContentHandler h = new 
> BodyContentHandler(-1); PDFParserConfig pdfConfig = new 
> PDFParserConfig(); pdfConfig.setExtractInlineImages(true);
> c.set(PDFParserConfig.class, pdfConfig); c.set(Parser.class, parser); 
> EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(c); 
> c.set(EmbeddedDocumentExtractor.class, ex); parser.parse(inputstream, 
> h, m, c);
>
>
> Thanks!
>
> Regards,
>
> Eli
>
>
>
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org




FW: Tika calling exiftool and ffmpeg?

2016-09-01 Thread Allison, Timothy B.


From: Chris Bamford [mailto:cbamf...@mimecast.com]
Sent: Thursday, September 1, 2016 7:03 AM
To:  
Subject: Tika calling exiftool and ffmpeg?

Hi

I recently noticed on my linux box in the auditd logs that my JVM is making 
repeated attempts to call exiftool and ffmpeg.  While it is just noisy and not 
necessarily a problem, I'd like to understand more about when / why this 
happens.  I assume it is something related to the tika-external-parsers.xml 
file ?

I am using tika 1.12.

Thanks

- Chris

Chris Bamford

m: +44 7860 405292

www.mimecast.com

Lead Software Engineer

p: +44 207 847 8700

Address click here


[http://www.mimecast.com]




[LinkedIn]


[YouTube]


[Facebook]


[Blog]


[Twitter]












RE: Apache Tikaで、PDFの本文内の文字が連続する現象発生

2016-09-14 Thread Allison, Timothy B.
Again, relying on google translate.  Y, I would think that suppressing 
overlapping characters should solve this problem.  Try pure PDFBox, and if the 
problem is there, try asking on the PDFBox list.


いきなりですが、表記件についてご質問させてください。

Javaで、Apache Tikaで、PDFのパース処理をしています。
ほとんどのPDFは、正常に、読み込めるのですが、パースエラーになったり、
パースできても、本文内の文字が連続する現象発生します。

ここで、お聞きしたいのは、「本文内の文字が連続する現象」の原因と対策方法です。
パースで取り出した長文の中から同じようなパターンの一部を下記へ抜粋。

⇒ 「(1)(1)(1)(1)林火林火林火林火DBDBDBDB」

おそらく、PDFの「(1)風林火山用DB」が書かれている部分をTikaが
取り出したときに、
PDFのコメント?、アクセシビリティ?、何かしら、普通に開いた時には見えないが、
PDFに埋め込まれているもの?をTikaがパースで取り出したのでは?と考えています。(想像)

ソース:
-
File document = new File("/usr/local/sample.pdf"); Parser parser = new 
AutoDetectParser(); ContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext()); String plainText = handler.toString(); 
System.out.println(plainText);
-

原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?



また、上記でだめでしたので、
どうやら、文字が連続する場所は、太字やアンダーバーがあるので、
下記のソースへ改造しましたが、結果が全く変わりません。
何か、お気づきの問題点などや解決策はありますでしょうか?

ソース:
-
File document = new File("/usr/local/sample.pdf"); PDFParser parser = new 
PDFParser(); PDFParserConfig config = new PDFParserConfig();

// 太字などを文字を重ねることで表現している場合における重複文字を無視す
るかどうか = 無視したい
config.setSuppressDuplicateOverlappingText(true);

// テキスト下線などを無視するかどうか = 無視したい
config.setExtractAnnotationText(false);

parser.parse(new FileInputStream(document), handler, metadata, new 
ParseContext());

String plainText = handler.toString();
System.out.println(plainText);
-


Tika初心者



RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Again, relying on Google translate.

The problem with these files is that they don't self identify their encoding 
via http metaheaders, and they contain very little content so Mozilla's 
UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
Firefox all fail on these files, too.

If you know that a file is EUC_JP, you can send a hint via the metadata before 
the call to parse:


Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext()); String plainText = handler.toString();


-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 7:37 AM
To: user@tika.apache.org
Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。

※先程のメールに添付したのは、秀丸エディタで保存したときに、
   文字コードが変わったようで、文字化けしません。

ー
こんにちは。

困っております。

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。

原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?

■読み込むと文字化けするhtmlを添付します。
  ※EUCコードのファイルです。(秀丸エディタの判定では)

ソース:
-
File document = new File("/usr/local/sample.pdf"); Parser parser = new 
AutoDetectParser(); ContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext()); String plainText = handler.toString(); 
System.out.println(plainText);
-


-- 
技術初心者


RE: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

2016-09-14 Thread Allison, Timothy B.
If a PDF requires a password (and it isn't the empty string) and you have the 
password, you need to send it in via the ParseContext:

ParseContext context = new ParseContext();
context.set(PasswordProvider.class, new PasswordProvider() {
public String getPassword(Metadata metadata) {
return "thisIsThePassword";
}
});

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:55 AM
To: user@tika.apache.org
Subject: Re: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている

Do you, says the text of the protected PDF files can not be parsed by Tika?
I, if the specification of Tika, you give up the Perth.
(あなたは、保護されたPDFファイルのテキストをTikaでパースできないと言って
いますか?
私は、Tikaの仕様なら、パースを諦めます。)


Is the specification of Tika?


-- 
question.answer...@gmail.com 



> Relying on google translate...  I'm not sure how protection could lead to 
> garbled text; if the file is password protected, you shouldn't get any text.
> 
> 
> Try troubleshooting with pure PDFBox:
> 
> https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems
> 
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 7:22 AM
> To: user@tika.apache.org
> Subject: Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしている
> 
> 皆様、始めまして。
> 
> Tika初心者です。
> 
> いきなりですが、表記件についてご質問させてください。
> 
> Apache Tikaで、保護されたPDFを取り込むと全文が文字化けしているのですが、
> これは、仕様でしょうか?
> 設定などで回避して文字化けなしで取り込む方法はありますでしょうか?
>   ※保護されていないPDFは、文字化けなく、取り込めます。
> 
> 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> 
> 
> ソース:
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> 補足:
> ・保護されたPDFは、手動でテキストのコピーができない。
> 
> 
> Tika初心者




RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Sorry, can't tell what the question is?

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:50 AM
To: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Hi :)

I, in any way to, or should I use the following in the program of Tika?
私は、どのようにして、下記をTikaのプログラムで使えばいいですか?

-
Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.
-


-- 
question.answer...@gmail.com <question.answer...@gmail.com>



> Ha, thank you for running google translate for me. :)
> 
> If the question is: "If I don't know the encoding before I send it to Tika, 
> how does Tika determine the encoding?"
> 
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> 
> You can modify the service loading file to run the encoders in a different 
> order or to specify your own encoding detector.
> 
> If the question is, "Why can't Tika get it right?"  Well, there are limits to 
> statistical inference on only a few observations (small amount of bytes). :)
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 11:06 AM
> To: user@tika.apache.org
> Cc: Allison, Timothy B. <talli...@mitre.org>
> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Thank you for your answer.
> 
> I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
> etc. in advance.
> I, or JAVA library, I want you to determine to Tika.
> I want to know the determination method.
> 
> 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> 私は、JAVAのライブラリか、Tikaに判断してほしい。
> 私は、その判断方法を知りたい。
> 
> 
> 技術初心者
> 
> 
> 
> > Again, relying on Google translate.
> > 
> > The problem with these files is that they don't self identify their 
> > encoding via http metaheaders, and they contain very little content so 
> > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > Chrome and Firefox all fail on these files, too.
> > 
> > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > before the call to parse:
> > 
> > 
> > Metadata metadata = new Metadata();
> > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > parser.parse(new FileInputStream(document), handler, metadata
> >  
> > , new ParseContext()); String plainText = handler.toString();
> > 
> > 
> > -Original Message-
> > From: question.answer...@gmail.com 
> > [mailto:question.answer...@gmail.com]
> > Sent: Wednesday, September 14, 2016 7:37 AM
> > To: user@tika.apache.org
> > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > 
> > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> >文字コードが変わったようで、文字化けしません。
> > 
> > ー
> > こんにちは。
> > 
> > 困っております。
> > 
> > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > 
> > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> > 
> > ■読み込むと文字化けするhtmlを添付します。
> >   ※EUCコードのファイルです。(秀丸エディタの判定では)
> > 
> > ソース:
> > -
> > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > AutoDetectParser(); ContentHandler handler = new 
> > BodyContentHandler(Integer.MAX_VALUE);
> > Metadata metadata = new Metadata();
> > parser.parse(new FileInputStream(document), handler, metadata
> >  
> > , new ParseContext()); String plainText = handler.toString(); 
> > System.out.println(plainText);
> > -
> > 
> > 
> > --
> > 技術初心者




RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Ha, thank you for running google translate for me. :)

If the question is: "If I don't know the encoding before I send it to Tika, how 
does Tika determine the encoding?"

Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.

You can modify the service loading file to run the encoders in a different 
order or to specify your own encoding detector.

If the question is, "Why can't Tika get it right?"  Well, there are limits to 
statistical inference on only a few observations (small amount of bytes). :)

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:06 AM
To: user@tika.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.

私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> parser.parse(new FileInputStream(document), handler, metadata
>  
> , new ParseContext()); String plainText = handler.toString();
> 
> 
> -Original Message-
> From: question.answer...@gmail.com 
> [mailto:question.answer...@gmail.com]
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: user@tika.apache.org
> Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>文字コードが変わったようで、文字化けしません。
> 
> ー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> 
> ■読み込むと文字化けするhtmlを添付します。
>   ※EUCコードのファイルです。(秀丸エディタの判定では)
> 
> ソース:
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  
> , new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> --
> 技術初心者



RE: Disabling Zip bomb detection in Tika

2016-09-22 Thread Allison, Timothy B.
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.

Great!  Tika 1.14 is around the corner...if this is an easy fix ... :)

Thank you.



RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
You can reuse AutoDetectParser in a multithreaded environment.  You shouldn’t 
have problems with performance or thread safety.

If you find otherwise, please let us know! ☺

From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org
Subject: Is creating new AutoDetectParsers expensive?

Hi all!
Let's assume there are really many files to be parsed, and the operation is 
repeated a relatively large number of times each day.
Is it, in that case, too expensive to create new AutoDetectParsers for every 
file? Or, in other words, if I were to reuse a AutoDetectParser for a large 
number of files, would I:
* Have problems with thread-safety?
* Have problems with performance?
Thanks you very much!
Haris Osmanagić


RE: Is creating new AutoDetectParsers expensive?

2016-09-30 Thread Allison, Timothy B.
In an earlier version of tika-batch, we had a single AutoDetectParser per 
thread, and we had no problems.  I experimented with a single AutoDetectParser 
across the threads, and we didn’t have problems.

Because of configuration issues, tika-batch is now creating a new parser for 
each file.

In our unit test suite, last I experimented with this, the first initialization 
did take a while, but then there was no measurable extra cost to instantiating 
a new parser.   In short, we didn’t save anything by using a static 
AutoDetectParser instead of just instantiating a new one for each unit test.

If you are going from file system to file system, you might want to consider 
tika-batch.

java -jar tika-app.jar -i  -o 

If you have a whole lot of files (millions), try to isolate Tika in its own jvm 
or server or data center; bad things can happen.  See slide 17: 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf

And: 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/

From: Haris Osmanagic [mailto:haris.osmana...@gmail.com]
Sent: Friday, September 30, 2016 10:54 AM
To: user@tika.apache.org
Subject: Re: Is creating new AutoDetectParsers expensive?

I read the first sentence and thought: "Yes! I can save ourselves a bunch of 
memory!"
Then I read the second: "Oh, oh, do I dare trying it out?" : )
Thank you very much for the super-speedy response!

On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
You can reuse AutoDetectParser in a multithreaded environment.  You shouldn’t 
have problems with performance or thread safety.

If you find otherwise, please let us know! ☺

From: Haris Osmanagic 
[mailto:haris.osmana...@gmail.com<mailto:haris.osmana...@gmail.com>]
Sent: Friday, September 30, 2016 10:36 AM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Is creating new AutoDetectParsers expensive?

Hi all!
Let's assume there are really many files to be parsed, and the operation is 
repeated a relatively large number of times each day.
Is it, in that case, too expensive to create new AutoDetectParsers for every 
file? Or, in other words, if I were to reuse a AutoDetectParser for a large 
number of files, would I:
* Have problems with thread-safety?
* Have problems with performance?
Thanks you very much!
Haris Osmanagić


RE: PDF Processing

2016-11-07 Thread Allison, Timothy B.
>https://www.free-decompiler.com/flash/ and perhaps a thing for me to do would 
>be to add abstraction support for this parser to Tika.

Y, the license on that is incompatible with the Apache License[1], so we can't 
include it unless we get the authors to change the license.  Also, it looks 
like it requires native libs?  But _you_ could easily write a wrapper for it 
for your use [2].

> I am using Tika to extract and later scan/process components of any document 
> that may perform malicious actions ... either add enhancements or maybe add 
> new parsers.
This has been something I'm struggling with...how much is the right amount of 
processing for a general tool like Tika.  True forensic analysis is, indeed, a 
tall order, and there is an abundance of file-format-specific scripts to handle 
various aspects.  In short, be careful on relying on what we have so far and 
please do open issues as needed:

1)  We made some recent improvements to macro extraction in POI, but those 
won't be folded in until Tika 1.15.

2)  My initial patch for javascript extraction from PDFs will not handle 
the more fun obfuscation techniques [3].

   Best,

 Tim

[1] https://www.apache.org/legal/resolved#category-x
[2] https://tika.apache.org/1.13/parser_guide.html
[3] just google pdf javascript obfuscation

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Sunday, November 6, 2016 8:41 PM
To: user@tika.apache.org
Subject: RE: PDF Processing

I forgot to answer "If there are other components that you'd like to have 
extracted, let us know, and we'll consider adding them." I am using Tika to 
extract and later scan/process components of any document that may perform 
malicious actions. So that is any script-like or macro-like construct, plus any 
binary data, embedded images and so forth. So essentially I need to break down 
all components of all documents, which is a tall order of course. But it seems 
like the collection of parsers that Tika provides is my best bet, and either 
add enhancements or maybe add new parsers.

For instance it seems that Flash is only supported via flv. There is what looks 
like a good parser here: https://www.free-decompiler.com/flash/ and perhaps a 
thing for me to do would be to add abstraction support for this parser to Tika.

Jim

From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Thursday, November 3, 2016 10:11
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing

PDAction extraction is probably what I need. Embedded streams in general, 
though for non-text "pieces" it would be fine to get offset and length 
information from some event. I will take a look at your example output below.

I'll press on with Tika as an abstraction for now as I generally like what I 
see. I am just a bit worried that the one abstraction to rule them all may 
preclude me from easily handling more esoteric parts of some document formats.

I presume that the best way to request enhancements is to create a JIRA entry 
so it can be tracked?

Thanks for your help,

Jim

From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, November 2, 2016 19:02
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: PDF Processing

It depends (tm).  As soon as 1.14 is released, I'll add PDAction extraction 
from PDFs (TIKA-2090), and that will include javascript (as stored in 
PDActions)... that capability doesn't currently exist.  If there are other 
components that you'd like to have extracted, let us know, and we'll consider 
adding them.

If you want a look at what javascript extraction will look like, I recently 
extracted ~70k javascript elements from our 500k regression corpus:
http://162.242.228.174/embedded_files<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles=DgMFAg=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo=nNP4J4eB9FTGgO9ZlvgSUhiVtxLFZuS47JwZ4stKBqo=>

specifically:

http://162.242.228.174/embedded_files/js_in_pdfs.tar.bz2<https://urldefense.proofpoint.com/v2/url?u=http-3A__162.242.228.174_embedded-5Ffiles_js-5Fin-5Fpdfs.tar.bz2=DgMFAg=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw=8uGU7SdG78oljX4iwYYOehjtb2OSMGLMdmcUYv63Zuo=_R9F1g8DMvxLVjPhOFrvLS6kS4_cALopdcqWez1cs1U=>

> entire structure of a document and extract any or all pieces from it.
Within reason(tm), that _is_ the goal of Tika.  The focus is text, but we try 
to maintain some structural information where we can, e.g. bold/italic/lists 
and paragraph boundaries in MSOffice and related formats.  We do not do full 
stylistic extraction (font name, size, etc), but the general formatting 
components that apply across formats, we try to maintain.



From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Wednesday, November 2, 2016 3:30 AM
To: user@tika.apache.

RE: Tika server RTF processing

2016-11-28 Thread Allison, Timothy B.
This is helpful.  IHow are you calling tika-server?  Are you specifying a 
content-type?

If I specify type=”application/rtf” or if I don’t specify a type, all is good.  
However, I get the same stacktrace that you shared if I incorrectly specify 
“application/msword”.

From: Allison A. [mailto:alliso...@gmail.com]
Sent: Thursday, November 24, 2016 10:39 PM
To: user@tika.apache.org
Subject: Re: Tika server RTF processing

Oops, I am re-posting and attaching them. It seems Ajax calls are not passed 
properly.

On Thu, Nov 24, 2016 at 7:06 AM, Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
There was a bug in some RTF files in 1.13, but that was fixed in 1.14 
(TIKA-1845).  We now have one rtf in our test suite for tika-server.

If you turn logging on, can you share a stacktrace, or can you share the 
offending file?

From: Allison A. [mailto:alliso...@gmail.com<mailto:alliso...@gmail.com>]
Sent: Tuesday, November 22, 2016 10:00 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: Tika server RTF processing

I am wondering if the RTF parser is working in Tika server 1.13 or 1.14 via an 
Ajax call.

I have tried both versions, but it seems I was not able to pass an Ajax call to 
Tika server, getting the 422 error,unprocessable entity. It worked fine with 
other MS office documents, Word, Excel, etc except RTF.

Thanks in advance.

Allison



RE: Temporary Files Location

2016-11-23 Thread Allison, Timothy B.
Have you tried via java opt:

-Djava.io.tmpdir=/someotherdir

From: Vérène Houdebine [mailto:verene.houdeb...@orange.fr]
Sent: Wednesday, November 23, 2016 8:38 AM
To: user 
Subject: Temporary Files Location


Hi!

I'm using Tika on a partitioned server that doesn't have much space in /tmp.
Since Tika stores all temporary files in /tmp and because I'm parsing huge 
files this can become a problem rapidly...
Is there a way to change the location where temporary files are being stored?

I saw a method called setTemporaryFileDirectory in TemporaryResources but it 
seems like it is never called anywhere in Tika


Thank you in advance for your help,

Verene


RE: Tika server RTF processing

2016-11-23 Thread Allison, Timothy B.
There was a bug in some RTF files in 1.13, but that was fixed in 1.14 
(TIKA-1845).  We now have one rtf in our test suite for tika-server.

If you turn logging on, can you share a stacktrace, or can you share the 
offending file?

From: Allison A. [mailto:alliso...@gmail.com]
Sent: Tuesday, November 22, 2016 10:00 PM
To: user@tika.apache.org
Subject: Tika server RTF processing

I am wondering if the RTF parser is working in Tika server 1.13 or 1.14 via an 
Ajax call.

I have tried both versions, but it seems I was not able to pass an Ajax call to 
Tika server, getting the 422 error,unprocessable entity. It worked fine with 
other MS office documents, Word, Excel, etc except RTF.

Thanks in advance.

Allison


  1   2   >