-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Normally Parsers triggers the ContentHandler.startDocument() method in their 
parse(...
ContentHandler ...) method for sure - this is also true in the case of an 
Error, which normally
throws an Exception.

We wrote and maintain an open source crawler lib (leech crawler) based on Tika, 
where we works
with special Content Handlers that deals with the recursive crawling issues. To 
recognize that
there is an error during the crawl, we are in need to recognize an Exception. 
On the other hand -
in the case there is no error - we need to recognize that there was a crawled 
entity (to count the
crawled items, etc.). To recognize this, we implemented the startDocument() 
method inside our
ContentHandler decorators.


This works like a charme, but inside MP4Parser, there exists these lines of 
code:


Line 146-154, parse() method:

        MovieBox moov = getOrNull(isoFile, MovieBox.class);
        if (moov == null) {
           // Bail out
           return;
        }


        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        ......
        ......


There, in the case there is no content?! inside the MP4 file, with a 'Bail out' 
comment, the parse
method will be leaved - at least for us - silently.

I don't know if this is also a problem in general (because Tika has also a 
plenty of
ContentHandler decorators), but from our point of view Tika signals an empty 
content with the
invocation of xhtml.startDocument() and xhtml.endDocument() with noting in 
between. In the case
this moov==null situation should be an error, an exception should be thrown.


If we are right (and we hope so, because we are in need of this ;) ) we want to 
suggest this
modification, as said:



        MovieBox moov = getOrNull(isoFile, MovieBox.class);
        if (moov == null) {
           // Bail out
           handler.startDocument();
           handler.endDocument();

           return;
        }



Looking forward to your opinions!

Chris



- -- 
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-1250
mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGkw+IACgkQ6EqMXq+WZg8RFQCeLNmQ9XnG7b1CHVyWVLkHDmhf
wccAmwRu6V28syceVJJ13c97+dNQ0Xkv
=9MGc
-----END PGP SIGNATURE-----

Reply via email to