[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171217#comment-15171217 ] Luis Filipe Nassif commented on TIKA-1824: -- Well, PDF also can be an attachment, office documents can be into a zip file, and PDF and zip are in its own modules. So I think it is OK to create an email module. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170711#comment-15170711 ] Luis Filipe Nassif commented on TIKA-1824: -- Great job [~bobpaulin]! I suggest putting MboxParser, OutlookPSTParser and RFC822Parser in a separete tika-mail-parser module. OutlookPSTParser depends on java-lib-pst, not on POI. MboxParser depends on RFC822Parser. Unfortunately Outlook MSG parsing depends on POI and should stay into tika-office-parser module. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133821#comment-15133821 ] Konstantin Gribov commented on TIKA-1824: - I'm on vacation now, so reveiwed this topic only briefly. Greate work (y). I would take a look at 2.x branch after I will return. Do mbox, outlook and rfc822 parsers go to one module? My +1 to prefixing `artifactId`s with `tika-parser(s)-` or at least `tika-`. I personally prefer `tika-parsers-` which is eloquent and meaningfull name for artifact making Tika use simpler for downstream developers. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132503#comment-15132503 ] Tim Allison commented on TIKA-1824: --- bq. Thanks so much for the feedback, these are great things to be discussing. Yes, yes, indeed. Thank you, [~kkrugler], [~rgauss], and of course [~bobpaulin]! Consensus for now...keep as is? Sounds good to me. bq. so I was considering creating projects with a bundle suffix that would embed the dependencies individually as tika-bundle did... Interesting. So, OSGi aside for the following (sorry), for those with, um, challenged development environments (i.e. medical/financial fields where you might only be allowed to bring in publicly released jars), users who only wanted to parse pdfs, say, could then grab tika-core.jar, the tika-batch.jar, the orig-tika-app.jar and the tika-parser-pdf-bundle.jar and be able to parse pdfs? That would be awesome from the standpoint of several use cases I've seen. Did I get this right? What do others think? > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132507#comment-15132507 ] Tim Allison commented on TIKA-1824: --- Sorry, [~grossws], [~thaichat04] and [~lfcnassif] should have included you in the above! :) > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130386#comment-15130386 ] Ray Gauss II commented on TIKA-1824: bq. Thank you, Bob Paulin! Again, this is fantastic. Indeed, thanks! bq. Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module Now that the change is in there it seems a bit redundant to have parser and module in every artifact ID. {{tika-parser-*}} follows the least to most specific precedence and they're so perhaps we could just remove module? I had some concerns over the apparent duplication of dependencies / versions but it looks like that will be addressed in TIKA-1847. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131500#comment-15131500 ] Tim Allison commented on TIKA-1824: --- bq. Perhaps add "parser(s?) to the artifactId Y, sorry, [~bobpaulin], now that I see it, I'm changing my mind... Should we get rid of "tika-parser-" entirely, e.g.: * advanced-module * cad-module or perhaps: * advanced-parsers * cad-parsers > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131650#comment-15131650 ] Bob Paulin commented on TIKA-1824: -- So before we go that way let me explain what about your previous suggestion made me change my mind. Consider the developer looking at this in a lib directory or an IDE. If they just see advanced-parser or cad-parser I feel we're opening the door for confusion as there are many other libraries that do parsing. Thought it's redundant to the maintainers to have tika-parser-* as a prefix it could ease the life of the end user developer trying to sort out JAR hell on there classpath. Second I have the module suffix because I'm still mulling how to replace tika-bundle. Currently there are still many tika dependencies that are not OSGi friendly. We've been getting around this by embedding them in tika-bundle. The module suffix jars do not have dependencies embedded so I was considering creating projects with a bundle suffix that would embed the dependencies individually as tika-bundle did. I'm curious what the rest of the community thinks of this approach. Naturally if we figure out a way to eliminate the need for 2 then I agree the module suffix is redundant and can be removed. My 2 cents. Thanks so much for the feedback, these are great things to be discussing. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131749#comment-15131749 ] Ken Krugler commented on TIKA-1824: --- As someone who regularly deals with 100s of jars in the dependency tree, I'm a big +1 for having "tika-" as a prefix for every jar. I'm less concerned about tika-cad-parsers vs. tika-parsers-cad (as an example), with a mild preference for the former. I'd rather not have the module suffix, mostly because I haven't been paying any attention to the OSGi issues, nor do I have a use case for that yet, and thus it doesn't add value for me personally. But that's a very weak -1, given my lack of background in this space. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106752#comment-15106752 ] Tim Allison commented on TIKA-1824: --- Thank you, [~bobpaulin]! Again, this is fantastic. I should have a chance to take a look later today. [~chrismattmann], [~gagravarr], [~kkrugler], [~lewismc],[~rgauss] or others, any feedback on this massive refactoring? > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103913#comment-15103913 ] Hudson commented on TIKA-1824: -- SUCCESS: Integrated in tika-2.x #13 (See [https://builds.apache.org/job/tika-2.x/13/]) TIKA-1824 - Add CTakes resource to scientific module (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725119]) * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser/ctakes * trunk/tika-parser-modules/tika-parser-scientific-module/src/main/resources/org/apache/tika/parser/ctakes/CTAKESConfig.properties TIKA-1824 - Remove CTakes resource from web module (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725118]) * trunk/tika-parser-modules/tika-parser-web-module/src/main/resources/org > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103490#comment-15103490 ] Hudson commented on TIKA-1824: -- SUCCESS: Integrated in tika-2.x #11 (See [https://builds.apache.org/job/tika-2.x/11/]) TIKA-1824 - Lowercase parent parser module (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725045]) * trunk/tika-parser-modules/pom.xml > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103450#comment-15103450 ] Hudson commented on TIKA-1824: -- SUCCESS: Integrated in tika-2.x #10 (See [https://builds.apache.org/job/tika-2.x/10/]) TIKA-1824 - Moved parser text to before module name. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725033]) * trunk/tika-parser-bundles/tika-multimedia-bundle/pom.xml * trunk/tika-parser-modules/pom.xml * trunk/tika-parser-modules/tika-advanced-parser-module * trunk/tika-parser-modules/tika-cad-parser-module * trunk/tika-parser-modules/tika-code-parser-module * trunk/tika-parser-modules/tika-database-parser-module * trunk/tika-parser-modules/tika-ebook-parser-module * trunk/tika-parser-modules/tika-journal-parser-module * trunk/tika-parser-modules/tika-multimedia-parser-module * trunk/tika-parser-modules/tika-office-parser-module * trunk/tika-parser-modules/tika-package-parser-module * trunk/tika-parser-modules/tika-parser-advanced-module * trunk/tika-parser-modules/tika-parser-advanced-module/pom.xml * trunk/tika-parser-modules/tika-parser-advanced-module/src * trunk/tika-parser-modules/tika-parser-cad-module * trunk/tika-parser-modules/tika-parser-cad-module/pom.xml * trunk/tika-parser-modules/tika-parser-cad-module/src * trunk/tika-parser-modules/tika-parser-code-module * trunk/tika-parser-modules/tika-parser-code-module/pom.xml * trunk/tika-parser-modules/tika-parser-code-module/src * trunk/tika-parser-modules/tika-parser-database-module * trunk/tika-parser-modules/tika-parser-database-module/pom.xml * trunk/tika-parser-modules/tika-parser-database-module/src * trunk/tika-parser-modules/tika-parser-ebook-module * trunk/tika-parser-modules/tika-parser-ebook-module/pom.xml * trunk/tika-parser-modules/tika-parser-ebook-module/src * trunk/tika-parser-modules/tika-parser-journal-module * trunk/tika-parser-modules/tika-parser-journal-module/pom.xml * trunk/tika-parser-modules/tika-parser-journal-module/src * trunk/tika-parser-modules/tika-parser-multimedia-module * trunk/tika-parser-modules/tika-parser-multimedia-module/pom.xml * trunk/tika-parser-modules/tika-parser-office-module * trunk/tika-parser-modules/tika-parser-office-module/pom.xml * trunk/tika-parser-modules/tika-parser-office-module/src * trunk/tika-parser-modules/tika-parser-package-module * trunk/tika-parser-modules/tika-parser-package-module/pom.xml * trunk/tika-parser-modules/tika-parser-package-module/src * trunk/tika-parser-modules/tika-parser-pdf-module * trunk/tika-parser-modules/tika-parser-pdf-module/pom.xml * trunk/tika-parser-modules/tika-parser-pdf-module/src * trunk/tika-parser-modules/tika-parser-scientific-module * trunk/tika-parser-modules/tika-parser-scientific-module/pom.xml * trunk/tika-parser-modules/tika-parser-scientific-module/src * trunk/tika-parser-modules/tika-parser-text-module * trunk/tika-parser-modules/tika-parser-text-module/pom.xml * trunk/tika-parser-modules/tika-parser-text-module/src * trunk/tika-parser-modules/tika-parser-web-module * trunk/tika-parser-modules/tika-parser-web-module/pom.xml * trunk/tika-parser-modules/tika-pdf-parser-module * trunk/tika-parser-modules/tika-scientific-parser-module * trunk/tika-parser-modules/tika-text-parser-module * trunk/tika-parser-modules/tika-web-parser-module * trunk/tika-parsers/pom.xml > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15103416#comment-15103416 ] Hudson commented on TIKA-1824: -- SUCCESS: Integrated in tika-2.x #9 (See [https://builds.apache.org/job/tika-2.x/9/]) TIKA-1824 - Added SVN ignores (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725015]) * trunk/tika-parser-modules/tika-advanced-parser-module * trunk/tika-parser-modules/tika-cad-parser-module * trunk/tika-parser-modules/tika-code-parser-module * trunk/tika-parser-modules/tika-database-parser-module * trunk/tika-parser-modules/tika-ebook-parser-module * trunk/tika-parser-modules/tika-journal-parser-module * trunk/tika-parser-modules/tika-office-parser-module * trunk/tika-parser-modules/tika-package-parser-module * trunk/tika-parser-modules/tika-pdf-parser-module * trunk/tika-parser-modules/tika-scientific-parser-module * trunk/tika-parser-modules/tika-text-parser-module TIKA-1824 - Big Renaming. Adding parsers to the artifact names and descriptions. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1725014]) * trunk/tika-parser-bundles/tika-multimedia-bundle/pom.xml * trunk/tika-parser-modules/pom.xml * trunk/tika-parser-modules/tika-advanced-module * trunk/tika-parser-modules/tika-advanced-parser-module * trunk/tika-parser-modules/tika-advanced-parser-module/pom.xml * trunk/tika-parser-modules/tika-advanced-parser-module/src * trunk/tika-parser-modules/tika-advanced-parser-module/src/main * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/crypto * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/NERecogniser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/corenlp * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/regex * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF/services * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner/regex * trunk/tika-parser-modules/tika-advanced-parser-module/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt * trunk/tika-parser-modules/tika-advanced-parser-module/src/test * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika/parser * trunk/tika-parser-modules/tika-advanced-parser-module/src/test/java/org/apache/tika/parser/crypto *
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097884#comment-15097884 ] Nick Burch commented on TIKA-1824: -- Tika already supports using a custom classloader for loading parser + detector classes + spi files - http://tika.apache.org/1.11/api/org/apache/tika/config/TikaConfig.html#TikaConfig%28java.lang.ClassLoader%29 > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096668#comment-15096668 ] Uwe Schindler commented on TIKA-1824: - Hi, as invited on TIKA-1830, here some comments from Apache Solr: {quote} As already stated in the past, we would like to only bundle parsers for text document formats, because images, class files or else are not really useful for indexing by default. Users that want to do this, can still add the missing parser bundles and SPI will do the rest. Currently we have disabled some parsers by removing the JAR files (like asm-all.jar, netcdf.jar), so TIKA's SPI will disable them automatically (because of ClassNotFoundEx). This was a bit rude, but worked. The reason for this was partly also some version incompatibilities (ASM was old in TIKA, Lucene needs newest one), but ASM is not really useful for indexing anyways! In Solr we don't use transitive dependencies in Ivy, so we decide for each JAR file which one gets bundled, so we check every release anyways during update. {quote} In addition, it would be a good idea to allow loading the TIKA SPI files in a separate classloader (to isolate the parser classes from others). The reason for this is JAR hell. If TIKA would load the parsers in its own classloader (optionally, e.g. by configuration), we could place all parsers and their dependencies in a separate lib directory outside the Solr's lib folder. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090903#comment-15090903 ] Hudson commented on TIKA-1824: -- SUCCESS: Integrated in tika-2.x #6 (See [https://builds.apache.org/job/tika-2.x/6/]) TIKA-1824 - Fixed incorrect path prefix condition. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723903]) * trunk/tika-test-resources/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090285#comment-15090285 ] Bob Paulin commented on TIKA-1824: -- * Perhaps rename artifact names in parser sub-components to include "Parser(s?)", e.g. Apache Tika Parser Advanced Module so that the names sort more clearly (at least in the maven window in Intellij)? I think I felt it was redundant but in a maven repo it could be helpful so I can make that change. * Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module Same as above. * Perhaps lowercase names in parser-subcomponents so that they're inline with legacy: "Apache Tika parser advanced module" I think I'm missing where this convention is coming from. * Pkcs7Parser ... should that be under advanced...or somewhere else ...own crypto package? So I don't feel strongly that it needs to be under advanced but I do want to be careful not to over do the number of modules. Do you feel crypto has room for growth or is this just going to forever be a one parser project? * iwork ...should we move that to office? I think it could fit there too. No issues moving. * tika-test-resources...should we move TikaTest into that and change the name to tika-test? I have a vague memory of wanting to carve out a separate test package earlier and adding TikaTest and something else... I think it could work in tika-core or tika-test. I don't think I feel strongly either way. * OutlookPSTParser...move that to office? I'd like to keep this class with all the other mbox classes. Maybe me mbox to office? * Does MBox belong in web? Not sure where to put it? Move to office? * Move CommonsDigester to core if we're willing to add a dependency on commons-codec into core? I'm fine with this. * Move Activator to tika-bundle? I believe tika-bundle already has an activator. Could just remove this. * Move pot to multimedia or add tika-parsers-multimedia-advanced-module? Not sure I understand POT in multimedia. Can you elaborate? * Move geo.topic to "advanced"...perhaps we rename "advanced" to ner? Is ner only applied to geo? My understanding of this domain is limited * Move ctakes to "advanced/ner"? Again my understanding of the domain is limited on what ctakes fits with. * Collapse web and text? Not sure I like that since a number of modules depend on text but not web. Seems like we'd be adding a lot of needless dependencies. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090355#comment-15090355 ] Hudson commented on TIKA-1824: -- UNSTABLE: Integrated in tika-2.x #5 (See [https://builds.apache.org/job/tika-2.x/5/]) TIKA-1824 - Move mbox to office. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723820]) * trunk/tika-parser-modules/tika-office-module/pom.xml * trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox * trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox/MboxParser.java * trunk/tika-parser-modules/tika-office-module/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java * trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox * trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java * trunk/tika-parser-modules/tika-office-module/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java * trunk/tika-parser-modules/tika-web-module/pom.xml * trunk/tika-parser-modules/tika-web-module/src/main/java/org/apache/tika/parser/mbox * trunk/tika-parser-modules/tika-web-module/src/test/java/org/apache/tika/parser/mbox > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087494#comment-15087494 ] Hudson commented on TIKA-1824: -- UNSTABLE: Integrated in tika-2.x #4 (See [https://builds.apache.org/job/tika-2.x/4/]) TIKA-1824 - Adding parent path to tika-test-resources (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723536]) * trunk/tika-test-resources/pom.xml > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085946#comment-15085946 ] Tim Allison commented on TIKA-1824: --- [~bobpaulin], this is an awesome step forward. Must have been a fair amount of work. Thank you! Few questions...not just for you, but for all. I'm happy to submit/commit patches, but I want to make sure I don't do anything objectionable to the community * This is probably user error, but I'm getting: \[ERROR\] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:2.10:unpack (unpack) on project tika-text-module: Unable to find artifact. Could not find artifact org.apache.tika:tika-test-resources:jar:tests:2.0-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) * Perhaps rename artifact names in parser sub-components to include "Parser(s?)", e.g. Apache Tika Parser Advanced Module so that the names sort more clearly (at least in the maven window in Intellij)? * Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module * Perhaps lowercase names in parser-subcomponents so that they're inline with legacy: "Apache Tika parser advanced module" * Pkcs7Parser ... should that be under advanced...or somewhere else ...own crypto package? * iwork ...should we move that to office? * tika-test-resources...should we move TikaTest into that and change the name to tika-test? I have a vague memory of wanting to carve out a separate test package earlier and adding TikaTest and something else... * OutlookPSTParser...move that to office? * Does MBox belong in web? Not sure where to put it? * Move CommonsDigester to core _if_ we're willing to add a dependency on commons-digest into core? * Move Activator to tika-bundle? * Move pot to multimedia or add tika-parsers-multimedia-advanced-module? * Move geo.topic to "advanced"...perhaps we rename "advanced" to ner? * Move ctakes to "advanced/ner"? * Collapse web and text? Again, this is fantastic. Thank you! > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086784#comment-15086784 ] Hudson commented on TIKA-1824: -- FAILURE: Integrated in tika-2.x #3 (See [https://builds.apache.org/job/tika-2.x/3/]) TIKA-1824 - Disable Dependency Reduced POM in tika-parsers. This is causing dependencies not to get pulled into tika-app. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723453]) * trunk/tika-parsers/pom.xml > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086735#comment-15086735 ] Hudson commented on TIKA-1824: -- FAILURE: Integrated in tika-2.x #2 (See [https://builds.apache.org/job/tika-2.x/2/]) TIKA-1824 - Added tika-test-resources to module list so it is built. (bob: [http://svn.apache.org/viewvc/tika/trunk/?view=rev=1723446]) * trunk/pom.xml > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086170#comment-15086170 ] Bob Paulin commented on TIKA-1824: -- A bit in a rush today but the answer to bullet one is you need to build the tika-test-resources project first before anything else. I think we should add tika-test-resources as a module to a parent pom so this happens automatically. Otherwise I'm sure many will hit this issue! > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)