[ 
https://issues.apache.org/jira/browse/TIKA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob Paulin updated TIKA-1904:
-----------------------------
    Description: 
There are several parsers and detectors that instantiate parsers and detectors 
that live in different modules in tika 2.0.  As of now these modules have are 
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module

May of these dependencies could be made optional by introducing the concept of 
proxy parser and detectors that would enable functionality if all the 
dependencies are included in the project but not throw a ClassNotFoundException 
if the dependent module was not include( ex. parse function would do nothing).

EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
        InputStream stream = null;
        Metadata metadata = new Metadata();
        HtmlParser htmlParser = new HtmlParser();
        ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
        ParseContext parser = new ParseContext();
        try {
            stream = new ByteArrayInputStream(byteObject);
            htmlParser.parse(stream, handler, metadata, parser);
        } catch (SAXException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {
            // Pushback overflow from tagsoup
        }
    }
{code}

Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;
    
    public ChmParser() {
        this.htmlProxyParser = new 
ParserProxy("org.apache.tika.parser.html.HtmlParser");
    }
{code}

And 

{code}

private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
        InputStream stream = null;
        Metadata metadata = new Metadata();
        ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
        ParseContext parser = new ParseContext();
        try {
            stream = new ByteArrayInputStream(byteObject);
            htmlProxyParser.parse(stream, handler, metadata, parser);
        } catch (SAXException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {
            // Pushback overflow from tagsoup
        }
    }
{code}



  was:
There are several parsers and detectors that instantiate parsers and detectors 
that live in different modules in tika 2.0.  As of now these modules have are 
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module

May of these dependencies could be made optional by introducing the concept of 
proxy parser and detectors that would enable functionality if all the 
dependencies are included in the project but not throw a ClassNotFoundException 
if the dependent module was not include( ex. parse function would do nothing).

EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
        InputStream stream = null;
        Metadata metadata = new Metadata();
        HtmlParser htmlParser = new HtmlParser();
        ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
        ParseContext parser = new ParseContext();
        try {
            stream = new ByteArrayInputStream(byteObject);
            htmlParser.parse(stream, handler, metadata, parser);
        } catch (SAXException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {
            // Pushback overflow from tagsoup
        }
    }
{code}

Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;
    
    public ChmParser() {
        this.htmlProxyParser = new 
ProxyParser("org.apache.tika.parser.html.HtmlParser");
    }
{code}

And 

{code}

private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
TikaException {// throws IOException
        InputStream stream = null;
        Metadata metadata = new Metadata();
        ContentHandler handler = new EmbeddedContentHandler(new 
BodyContentHandler(xhtml));// -1
        ParseContext parser = new ParseContext();
        try {
            stream = new ByteArrayInputStream(byteObject);
            htmlProxyParser.parse(stream, handler, metadata, parser);
        } catch (SAXException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {
            // Pushback overflow from tagsoup
        }
    }
{code}




> Tika 2.0 - Create Proxy Parser and Detectors
> --------------------------------------------
>
>                 Key: TIKA-1904
>                 URL: https://issues.apache.org/jira/browse/TIKA-1904
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 2.0
>            Reporter: Bob Paulin
>            Assignee: Bob Paulin
>
> There are several parsers and detectors that instantiate parsers and 
> detectors that live in different modules in tika 2.0.  As of now these 
> modules have are dependent on other modules this includes:
> tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module, 
> tika-parser-package-module
> tika-parser-ebook-module -> tika-parser-text-module
> tika-parser-journal-module -> tika-parser-pdf-module
> May of these dependencies could be made optional by introducing the concept 
> of proxy parser and detectors that would enable functionality if all the 
> dependencies are included in the project but not throw a 
> ClassNotFoundException if the dependent module was not include( ex. parse 
> function would do nothing).
> EX
> Currently
> ChmParser
> {code}
> private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
> TikaException {// throws IOException
>         InputStream stream = null;
>         Metadata metadata = new Metadata();
>         HtmlParser htmlParser = new HtmlParser();
>         ContentHandler handler = new EmbeddedContentHandler(new 
> BodyContentHandler(xhtml));// -1
>         ParseContext parser = new ParseContext();
>         try {
>             stream = new ByteArrayInputStream(byteObject);
>             htmlParser.parse(stream, handler, metadata, parser);
>         } catch (SAXException e) {
>             throw new RuntimeException(e);
>         } catch (IOException e) {
>             // Pushback overflow from tagsoup
>         }
>     }
> {code}
> Instead the HtmlParser could be Proxyed in the constructor
> {code}
> private final Parser htmlProxyParser;
>     
>     public ChmParser() {
>         this.htmlProxyParser = new 
> ParserProxy("org.apache.tika.parser.html.HtmlParser");
>     }
> {code}
> And 
> {code}
> private void parsePage(byte[] byteObject, ContentHandler xhtml) throws 
> TikaException {// throws IOException
>         InputStream stream = null;
>         Metadata metadata = new Metadata();
>         ContentHandler handler = new EmbeddedContentHandler(new 
> BodyContentHandler(xhtml));// -1
>         ParseContext parser = new ParseContext();
>         try {
>             stream = new ByteArrayInputStream(byteObject);
>             htmlProxyParser.parse(stream, handler, metadata, parser);
>         } catch (SAXException e) {
>             throw new RuntimeException(e);
>         } catch (IOException e) {
>             // Pushback overflow from tagsoup
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to