Hello all,

This was not an issue before but now it is.

I had tried to check the manual and online to see what has changed so I can
update my code but no success, hence decided to email the users list with
detail walk through of my code and the debugger.

Basically I was doing the following quite successfully until 1.11:

1) First I read a file into bytes:

String originalFilename = "/MyBio.doc";

InputStream stream = this.getClass().getResourceAsStream(originalFilename);
byte[] bytes;
try {
  bytes = IOUtils.toByteArray(stream);
} catch (Exception e) {
e.printStackTrace();
}

So far, so good as bytes are now filled.

Then, used to work fine but not anymore.


ByteArrayInputStream is = new ByteArrayInputStream(bytes);
Metadata metadata = new Metadata();
if (originalFilename.length() > 0) {
metadata.set(Metadata.RESOURCE_NAME_KEY, originalFilename);
}
Parser parser = new AutoDetectParser(); // Should auto-detect!
StringWriter textBuffer = new StringWriter();
BodyContentHandler handler = new BodyContentHandler(textBuffer);
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
// How I did originally get the output
System.out.println(textBuffer.toString());
// Tried this doesn't work
System.out.println(handler.toString());

On the debugger all is fine. Metadata object is properly created.

I have a BodyContentHandler initialized with an empyt textBuffer.

It is passed to ther parser with the ByteArrayInputStream is (which is
full), the handler, the metadate and the ParseContenxt.

Looking inside the method parser.parse, I can see that the variables are
correctly populated.

The mediaType is properly identified as application/msword

MetaData object as resourceName=/MyBio.doc Content-Type=application/msword

The Stream object has the full buffer as passed on the call.

>From AutoDetectParser.parse() method:

The TikaInputStream object has the stream as passed.

The MediaType object is correctly : application/msword



The SecureContentHandler is properly created at the line:

// TIKA-216: Zip bomb prevention
            SecureContentHandler sch =
                handler != null ? new SecureContentHandler(handler, tis) :
null;


>From the CompositeParser instance on the parse() method I have:

TikaInputStream taggedStream corrected populated with the stream contents.

TaggedContentHandler taggedHandler gets the BodyContentHandler object
passed and it is not null.

However on the call:

if (parser instanceof ParserDecorator){
                metadata.add("X-Parsed-By", ((ParserDecorator)
parser).getWrappedParser().getClass().getName());
            } else {
                metadata.add("X-Parsed-By", parser.getClass().getName());
            }

It goes to the else and puts the EmptyParser so now the Metada object reads:

So value is now X-Parsed-By=org.apache.tika.parser.EmptyParser
resourceName=/MyBio.doc Content-Type=application/msword

No exceptions

When the original call above parser.parse(is, handler, metadata, context);
returns, the handler.toString() is empty as well as the
textBuffer.toString(). It used to work really well before Tika 1.11

I wonder if I need to do something so that the EmptyParser is not used as
it was working before.

Thank you,

C.

Reply via email to